[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857164#action_12857164 ] Michael Busch commented on LUCENE-2324: --- {quote} It's for performance. I expect there are apps where a given thread/pool indexes certain kind of docs, ie, the app threads themselves have "affinity" for docs with similar term distributions. In which case, it's best (most RAM efficient) if those docs w/ presumably similar term stats are sent back to the same DW. If you mix in different term stats into one buffer you get worse RAM efficiency. {quote} I do see your point, but I feel like we shouldn't optimize/make compromises for this use case. Mainly, because I think apps with such an affinity that you describe are very rare? The usual design is a queued ingestion pipeline, where a pool of indexer threads take docs out of a queue and feed them to an IndexWriter, I think? In such a world the threads wouldn't have an affinity for similar docs. And if a user really has so different docs, maybe the right answer would be to have more than one single index? Even if today an app utilizes the thread affinity, this only results in maybe somewhat faster indexing performance, but the benefits would be lost after flusing/merging. If we assign docs randomly to available DocumentsWriterPerThreads, then we should on average make good use of the overall memory? Alternatively we could also select the DWPT from the pool of available DWPTs that has the highest amount of free memory? Having a fully decoupled memory management is compelling I think, mainly because it makes everything so much simpler. A DWPT could decide itself when it's time to flush, and the other ones can keep going independently. If you do have a global RAM management, how would the flushing work? E.g. when a global flush is triggered because all RAM is consumed, and we pick the DWPT with the highest amount of allocated memory for flushing, what will the other DWPTs do during that flush? Wouldn't we have to pause the other DWPTs to make sure we don't exceed the maxRAMBufferSize? Of course we could say "always flush when 90% of the overall memory is consumed", but how would we know that the remaining 10% won't fill up during the time the flush takes? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2324.patch, LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1698) Change backwards-compatibility policy
[ https://issues.apache.org/jira/browse/LUCENE-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch reassigned LUCENE-1698: - Assignee: (was: Michael Busch) :) > Change backwards-compatibility policy > - > > Key: LUCENE-1698 > URL: https://issues.apache.org/jira/browse/LUCENE-1698 > Project: Lucene - Java > Issue Type: Task >Reporter: Michael Busch >Priority: Minor > Fix For: 3.0 > > > These proposed changes might still change slightly: > I'll call X.Y -> X+1.0 a 'major release', X.Y -> X.Y+1 a > 'minor release' and X.Y.Z -> X.Y.Z+1 a 'bugfix release'. (we can later > use different names; just for convenience here...) > 1. The file format backwards-compatiblity policy will remain unchanged; >i.e. Lucene X.Y supports reading all indexes written with Lucene >X-1.Y. That means Lucene 4.0 will not have to be able to read 2.x >indexes. > 2. Deprecated public and protected APIs can be removed if they have >been released in at least one major or minor release. E.g. an 3.1 >API can be released as deprecated in 3.2 and removed in 3.3 or 4.0 >(if 4.0 comes after 3.2). > 3. No public or protected APIs are changed in a bugfix release; except >if a severe bug can't be changed otherwise. > 4. Each release will have release notes with a new section >"Incompatible changes", which lists, as the names says, all changes that >break backwards compatibility. The list should also have information >about how to convert to the new API. I think the eclipse releases >have such a release notes section. Furthermore, the Deprecation tag >comment will state the minimum version when this API is to be removed, > e.g. >@deprecated See #fooBar(). Will be removed in 3.3 >or >@deprecated See #fooBar(). Will be removed in 3.3 or later. > I'd suggest to treat a runtime change like an API change (unless it's fixing > a bug of course), > i.e. giving a warning, providing a switch, switching the default behavior > only after a major > or minor release was around that had the warning/switch. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2324: -- Attachment: lucene-2324.patch The patch removes all *PerThread classes downstream of DocumentsWriter. This simplifies a lot of the flushing logic in the different consumers. The patch also removes FreqProxMergeState, because we don't have to interleave posting lists from different threads anymore of course. I really like these simplifications! There is still a lot to do: The changes in DocumentsWriter and IndexWriter are currently just experimental to make everything compile. Next I will introduce DocumentsWriterPerThread and implement the sequenceID logic (which was discussed here in earlier comments) and the new RAM management. I also want to go through the indexing chain once again - there are probably a few more things to clean up or simplify. The patch compiles and actually a surprising amount of tests pass. Only multi-threaded tests seem to fail, which is not very surprising, considering I removed all thread-handling logic from DocumentsWriter. :) So this patch isn't working yet - just wanted to post my current progress. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2324.patch, LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1879) Parallel incremental indexing
[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855377#action_12855377 ] Michael Busch commented on LUCENE-1879: --- {quote} I'll start by describing the limitations of the current design (whether its the approach or the code is debatable): {quote} FWIW: The attached code and approach was never meant to be committed. I attached it for legal reasons, as it contains the IP that IBM donated to Apache via the software grant. Apache requires to attach the code that is covered by such a grant. I wouldn't want the master/slave approach in Lucene core. You can implement it much nicer *inside* of Lucene. The attached code however was developed with the requirement of having to run on top of an unmodified Lucene version. {quote} I've realized this when I found that if tests (in this patch) are run with "-ea", there are many assert exceptions that are printed from IndexWriter.startCommit. {quote} The code runs without exceptions with Lucene 2.4. It doesn't work with 2.9/3.0, but you'll find an upgraded version that works with 3.0 within IBM, Shai. > Parallel incremental indexing > - > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync > on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853751#action_12853751 ] Michael Busch commented on LUCENE-2324: --- Sorry, Jason, I got sidetracked with LUCENE-2329 and other things at work. I'll try to write the sequence ID stuff asap. However, there's more we need to do here that is sort of independent of the deleted docs problem. E.g. removing all the downstream perThread classes. We should work with the flex code from now on, as the flex branch will be merged into trunk soon. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853509#action_12853509 ] Michael Busch commented on LUCENE-2329: --- We could move the if (postingsArray == null) check to start(), then we don't have to check for every new term? > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329-2.patch, LUCENE-2329.patch, > LUCENE-2329.patch, LUCENE-2329.patch, lucene-2329.patch, lucene-2329.patch, > lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852858#action_12852858 ] Michael Busch commented on LUCENE-2329: --- Thanks! I think we can resolve this now? > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329-2.patch, LUCENE-2329.patch, > LUCENE-2329.patch, lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852625#action_12852625 ] Michael Busch commented on LUCENE-2329: --- Looks great! I like the removal of bytesAlloc - nice simplification. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329-2.patch, LUCENE-2329.patch, > LUCENE-2329.patch, lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2329: -- Attachment: lucene-2329-2.patch This patch: * Changes DocumentsWriter to trigger the flush using bytesAllocated instead of bytesUsed to improve the "running hot" issue Mike's seeing * Improves the ParallelPostingsArray to grow using ArrayUtil.oversize() In IRC we discussed changing TermsHashPerField to shrink the parallel arrays in freeRAM(), but that involves tricky thread-safety changes, because one thread could call DocumentsWriter.balanceRAM(), which triggers freeRAM() across *all* thread states, while other threads keep indexing. We decided to leave it the way it currently works: we discard the whole parallel array during flush and don't reuse it. This is not as optimal as it could be, but once LUCENE-2324 is done this won't be an issue anymore anyway. Note that this new patch is against the flex branch: I thought we'd switch it over soon anyway? I can also create a patch for trunk if that's preferred. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329-2.patch, lucene-2329.patch, > lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch resolved LUCENE-2126. --- Resolution: Fixed Committed revision 929340. > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch, lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2111) Wrapup flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851452#action_12851452 ] Michael Busch commented on LUCENE-2111: --- bq. Flex is generally faster. Awesome work! What changes make those queries run faster with the default codec? Mostly terms dict changes and automaton for fuzzy/wildcard? How's the indexing performance? bq. I think net/net we are good to land flex! +1! Even if there are still small things to change/fix I think it makes sense to merge with trunk now. > Wrapup flexible indexing > > > Key: LUCENE-2111 > URL: https://issues.apache.org/jira/browse/LUCENE-2111 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: Flex Branch >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: benchUtil.py, flex_backwards_merge_912395.patch, > flex_merge_916543.patch, flexBench.py, LUCENE-2111-EmptyTermsEnum.patch, > LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, > LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, > LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, > LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, > LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, > LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, > LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch, > LUCENE-2111_mtqNull.patch, LUCENE-2111_mtqTest.patch, > LUCENE-2111_toString.patch > > > Spinoff from LUCENE-1458. > The flex branch is in fairly good shape -- all tests pass, initial search > performance testing looks good, it survived several visits from the Unicode > policeman ;) > But it still has a number of nocommits, could use some more scrutiny > especially on the "emulate old API on flex index" and vice/versa code paths, > and still needs some more performance testing. I'll do these under this > issue, and we should open separate issues for other self contained fixes. > The end is in sight! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851451#action_12851451 ] Michael Busch commented on LUCENE-2126: --- I'll try to commit tonight to flex, but it'll probably be tomorrow (I think I have to update the patch, cause there were some changes to IndexInput/Output). If you want to merge flex into trunk sooner I can also just commit this afterwards to trunk. > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch, lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851142#action_12851142 ] Michael Busch commented on LUCENE-2324: --- {quote} The clarify, the apply deletes doc id up to will be the flushed doc count saved per term/query per DW, though it won't be saved, it'll be derived from the sequence id int array where the action has been encoded into the seq id int? {quote} Yeah, that's the idea. Let's see if it works :) > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851078#action_12851078 ] Michael Busch commented on LUCENE-2324: --- {quote} I'm not sure we need that level of complexity just yet? How would we make the transaction log memory efficient? {quote} Is that really so complex? You only need one additional int per doc in the DWPTs, and the global map for the delete terms. You don't need to buffer the actual terms per DWPT. I thought that's quite efficient? But I'm totally open to other ideas. I can try tonight to code a prototype of this - I don't think it would be very complex actually. But of course there might be complications I haven't thought of. bq. Are there other uses you foresee? Not really for the "transaction log" as you called it. I'd remove that log once we switch to deletes in the FG (when the RAM buffer is searchable). But a nice thing would be for add/update/delete to return the seqID, and also the if RAMReader in the future had an API to check up to which seqID it's able to "see". Then it's very clear to user of the API where a given reader is at. For this to work we have to assign the seqID at the *end* of a call. E.g. when adding a large document, which takes a long time to process, it should get the seqID assigned after the "work" is done and right before the addDocument() call returns. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850989#action_12850989 ] Michael Busch commented on LUCENE-2329: --- Good catch! Thanks for the thorough explanation and suggestions. I think it all makes sense. Will work on a patch. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850792#action_12850792 ] Michael Busch commented on LUCENE-2324: --- {quote} However, in the apply deletes method how would we know which doc to stop deleting at? How would the seq id map to a DW's doc id? {quote} We could have a global deletes-map that stores seqID -> DeleteAction. DeleteAction either contains a Term or a Query, and in addition an int "flushCount" (I'll explain in a bit what flushCount is used for.) Each DocumentsWriterPerThread would have a growing array that contains each seqID that "affected" that DWPT, i.e. the seqIDs of *all* deletes, plus the seqIDs of the adds/updates performed by that particular DWPT. One bit of a seqID in that array can indicate if it's a delete or add/update. When it's time to flush we sort the array by increasing seqID and then loop a single time through it to find the seqIDs of all DeleteActions. During the loop we count the number of adds/updates to determine the number of docs the DeleteActions affect. After applying the deletes the DWPT makes a synchronized call to the global deletes-map and increments the flushCount int for each applied DeleteAction. If flushCount==numThreadStates (== number of DWPT instances) the corresponding DeleteAction entry can be removed, because it was applied to all DWPT. I think this should work? Or is there a simpler solution? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850766#action_12850766 ] Michael Busch commented on LUCENE-2324: --- bq. I think for this same reason the ThreadBinder should have affinity Mike, can you explain what the advantages of this kind of thread affinity are? I was always wondering why the DocumentsWriter code currently makes efforts to assign a ThreadState always to the same Thread? Is that being done for performance reasons? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850760#action_12850760 ] Michael Busch commented on LUCENE-2324: --- Yes, we would need to buffer terms/queries per DW and also per DW the BufferedDeletes.Num. The docID spaces in two DWs will be completely independent of each other after this change. One potential problem that we (I think) even today have is the following: If you index with multiple threads, and then call e.g. deleteDocuments(Term) with one of the indexer threads while you keep adding documents with the other threads, it's not clear to the caller when exactly the deleteDocuments(Term) will happen. It depends on the thread scheduling. Going back to the idea I mentioned here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841407&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841407 I mentioned the idea of having a sequence ID, that gets incremented on add, delete, update. What if we had even with separate DWs a global sequence ID? The sequence ID would tell you unambiguously which action happened when. The add/update/delete methods could return the sequenceID that was assigned to that particular action. Then we could e.g. track the delete terms globally together with the sequenceID of the corresponding delete call, while we still apply deletes during flush. Since sequenceIDs enforce a strict ordering we can figure out to how many docs per DW we need to apply the delete terms. Later when we switch to real-time deletes (when the RAM is searchable) we will simply store the sequenceIDs in the deletes int[] array which I mentioned in my comment on LUCENE-2293. Does this make sense? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850312#action_12850312 ] Michael Busch commented on LUCENE-2324: --- bq. Not all apps index only 140 character docs from all threads What a luxury! :) {quote} I think for this same reason the ThreadBinder should have affinity, ie, try to schedule the same thread to the same DW, assuming it's free. If it's not free and another DW is free you should use the other one. {quote} If you didn't have such an affinity but use a random assignment of DWs to threads, would that balance the RAM usage across DWs without a global RAM management? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1879) Parallel incremental indexing
[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850268#action_12850268 ] Michael Busch commented on LUCENE-1879: --- LUCENE-2324 will be helpful to support multi-threaded parallel-indexing. If we have single-threaded DocumentsWriters, then it should be easy to have a ParallelDocumentsWriter? > Parallel incremental indexing > - > > Key: LUCENE-1879 > URL: https://issues.apache.org/jira/browse/LUCENE-1879 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch > Fix For: 3.1 > > Attachments: parallel_incremental_indexing.tar > > > A new feature that allows building parallel indexes and keeping them in sync > on a docID level, independent of the choice of the MergePolicy/MergeScheduler. > Find details on the wiki page for this feature: > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing > Discussion on java-dev: > http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850265#action_12850265 ] Michael Busch commented on LUCENE-2324: --- {quote} I'm not sure how we'd enforce the number of threads? Or we'd have to re-implement the wait system implemented in DW? {quote} I was thinking we were going to do that... having a fixed number of DocumentsWriterPerThread instances, and a ThreadBinder that let's a thread wait if the perthread is not available. You don't need to interleave docIds then? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850262#action_12850262 ] Michael Busch commented on LUCENE-2324: --- {quote} But if 1 thread tends to index lots of biggish docs... don't we want to allow it to use up more than 1/nth? Ie we don't want to flush unless total RAM usage has hit the limit? {quote} Sure that'd be the disadvantage. But is that a realistic scenario? That the "avg. document size per thread" differ significantly in an application? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850235#action_12850235 ] Michael Busch commented on LUCENE-2324: --- The easiest would be if each DocumentsWriterPerThread had a fixed buffer size, then they can flush fully independently and you don't need to manage RAM globally across threads. Of course then you'd need two config parameters: number of concurrent threads and buffer size per thread. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2346) Explore other in-memory postinglist formats for realtime search
Explore other in-memory postinglist formats for realtime search --- Key: LUCENE-2346 URL: https://issues.apache.org/jira/browse/LUCENE-2346 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 The current in-memory posting list format might not be optimal for searching. VInt decoding performance and the lack of skip lists would arguably be the biggest bottlenecks. For LUCENE-2312 we should investigate other formats. Some ideas: - PFOR or packed ints for posting slices? - Maybe even int[] slices instead of byte slices? This would be great for search performance, but the additional memory overhead might not be acceptable. - For realtime search it's usually desirable to evaluate the most recent documents first. So using backward pointers instead of forward pointers and having the postinglist pointer point to the most recent docID in a list is something to consider. - Skipping: if we use fixed-length postings ([packed] ints) we can do binary search within a slice. We can also locate a pointer then without scanning and thus skip entire slices quickly. Is that sufficient or would we need more skipping layers, so that it's possible to skip directly to particular slices? It would be awesome to find a format that doesn't slow down "normal" indexing, but is very efficient for in-memory searches. If we can't find such a fits-all format, we should have a separate indexing chain for real-time indexing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849899#action_12849899 ] Michael Busch commented on LUCENE-2324: --- Awesome! > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849819#action_12849819 ] Michael Busch commented on LUCENE-2324: --- Hey Jason, Disregard my patch here. I just experimented with removal of pooling, but then did LUCENE-2329 instead. TermsHash and TermsHashPerThread are now much simpler, because all the pooling code is gone after 2329 was committed. Should make it a little easier to get this patch done. Sure it'd be awesome if you could provide a patch here. I can help you, we should just frequently post patches here so that we don't both work on the same areas. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2324: -- Attachment: (was: lucene-2324-no-pooling.patch) > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch resolved LUCENE-2329. --- Resolution: Fixed Committed revision 926791. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848855#action_12848855 ] Michael Busch commented on LUCENE-2329: --- Cool, will do! Thanks for the review and good questions... and the whole idea! :) > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848827#action_12848827 ] Michael Busch edited comment on LUCENE-2329 at 3/23/10 6:06 PM: {quote} They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer? {quote} We actually save on 64bit JVMs (which I used for my tests) 28 bytes per unique-term: h4. Trunk: {code} // Why + 4*POINTER_NUM_BYTE below? // +1: Posting is referenced by postingsFreeList array // +3: Posting is referenced by hash, which // targets 25-50% fill factor; approximate this // as 3X # pointers bytesPerPosting = consumer.bytesPerPosting() + 4*DocumentsWriter.POINTER_NUM_BYTE; ... @Override int bytesPerPosting() { return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE; } ... abstract class RawPostingList { final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 3*DocumentsWriter.INT_NUM_BYTE; ... // Coarse estimates used to measure RAM usage of buffered deletes final static int OBJECT_HEADER_BYTES = 8; final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4; {code} This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes. h4. 2329: {code} // +3: Posting is referenced by hash, which // targets 25-50% fill factor; approximate this // as 3X # pointers bytesPerPosting = consumer.bytesPerPosting() + 3*DocumentsWriter.INT_NUM_BYTE; ... @Override int bytesPerPosting() { return ParallelPostingsArray.BYTES_PER_POSTING + 4 * DocumentsWriter.INT_NUM_BYTE; } ... final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE; {code} This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes. I checked how many bytes were allocated for postings when the first segment was flushed. Trunk flushed after 6400 docs and had 103MB allocated for PostingList objects. 2329 flushed after 8279 docs and had 94MB allocated for the parallel arrays, and 74MB out of the 94MB were actually used. The first docs in the wikipedia dataset seem pretty large with many unique terms. I think this sounds reasonable? was (Author: michaelbusch): {quote} They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer? {quote} We actually save on 64bit JVMs (which I used for my tests) 28 bytes per posting: h4. Trunk: {code} // Why + 4*POINTER_NUM_BYTE below? // +1: Posting is referenced by postingsFreeList array // +3: Posting is referenced by hash, which // targets 25-50% fill factor; approximate this // as 3X # pointers bytesPerPosting = consumer.bytesPerPosting() + 4*DocumentsWriter.POINTER_NUM_BYTE; ... @Override int bytesPerPosting() { return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE; } ... abstract class RawPostingList { final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 3*DocumentsWriter.INT_NUM_BYTE; ... // Coarse estimates used to measure RAM usage of buffered deletes final static int OBJECT_HEADER_BYTES = 8; final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4; {code} This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes. h4. 2329: {code} // +3: Posting is referenced by hash, which // targets 25-50% fill factor; approximate this // as 3X # pointers bytesPerPosting = consumer.bytesPerPosting() + 3*DocumentsWriter.INT_NUM_BYTE; ... @Override int bytesPerPosting() { return ParallelPostingsArray.BYTES_PER_POSTING + 4 * DocumentsWriter.INT_NUM_BYTE; } ... final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE; {code} This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes. I checked how many bytes were allocated for postings when the first segment was flushed. Trunk flushed after 6400 docs and had 103MB allocated for PostingList objects. 2329 flushed after 8279 docs and had 94MB allocated for the parallel arrays, and 74MB out of the 94MB were actually used. The first docs in the wikipedia dataset seem pretty large with many unique terms. I think this sounds reasonable? > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848827#action_12848827 ] Michael Busch commented on LUCENE-2329: --- {quote} They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer? {quote} We actually save on 64bit JVMs (which I used for my tests) 28 bytes per posting: h4. Trunk: {code} // Why + 4*POINTER_NUM_BYTE below? // +1: Posting is referenced by postingsFreeList array // +3: Posting is referenced by hash, which // targets 25-50% fill factor; approximate this // as 3X # pointers bytesPerPosting = consumer.bytesPerPosting() + 4*DocumentsWriter.POINTER_NUM_BYTE; ... @Override int bytesPerPosting() { return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE; } ... abstract class RawPostingList { final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 3*DocumentsWriter.INT_NUM_BYTE; ... // Coarse estimates used to measure RAM usage of buffered deletes final static int OBJECT_HEADER_BYTES = 8; final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4; {code} This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes. h4. 2329: {code} // +3: Posting is referenced by hash, which // targets 25-50% fill factor; approximate this // as 3X # pointers bytesPerPosting = consumer.bytesPerPosting() + 3*DocumentsWriter.INT_NUM_BYTE; ... @Override int bytesPerPosting() { return ParallelPostingsArray.BYTES_PER_POSTING + 4 * DocumentsWriter.INT_NUM_BYTE; } ... final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE; {code} This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes. I checked how many bytes were allocated for postings when the first segment was flushed. Trunk flushed after 6400 docs and had 103MB allocated for PostingList objects. 2329 flushed after 8279 docs and had 94MB allocated for the parallel arrays, and 74MB out of the 94MB were actually used. The first docs in the wikipedia dataset seem pretty large with many unique terms. I think this sounds reasonable? > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848782#action_12848782 ] Michael Busch commented on LUCENE-2329: --- {quote} OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie it only holds RAM for the one doc, at a time. {quote} Man, my brain is lacking the TermVector synapses... > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848748#action_12848748 ] Michael Busch commented on LUCENE-2329: --- {quote} so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting? {quote} Yeah it seems a bit suspicious. I'll investigate. But, keep in mind that TermVectors were enabled too. And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash in each round that sum should be higher than numPostings from the first TermsHash. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848475#action_12848475 ] Michael Busch edited comment on LUCENE-2329 at 3/23/10 12:51 AM: - I did some performance experiments: I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs files, disabled any merging, RAM buffer size = 200MB, single writer thread, TermVectors enabled. Test machine: MacBook Pro, 2.53 GHz Intel Core 2 Duo, 4 GB 1067 MHz DDR3, MacOS X 10.5.8. h4. Results with -Xmx2000m: || || Write performance || Gain || Number of segments || | trunk | 833 docs/sec | | 41 | | trunk + parallel arrays | 869 docs/sec | {color:green} + 4.3% {color} | 32| h4. Results with -Xmx256m: || || Write performance || Gain || Number of segments || | trunk | 467 docs/sec | | 41 | | trunk + parallel arrays | 871 docs/sec | {color:green} +86.5% {color} | 32| So I think these results are interesting and roughly as expected. 4.3% is a nice small performance gain. But running the tests with a low heap shows how much cheaper the garbage collection becomes. Setting IW's RAM buffer to 200MB and the overall heap to 256MB forces the gc to run frequently. The mark times are much more costly if we have all long-living PostingList objects in memory compared to parallel arrays. So this is probably not a huge deal for "normal" indexing. But once we can search on the RAM buffer it becomes much more attractive to fill up the RAM as much as you can. And exactly in that case we safe a lot with this improvement. Also note that the number of segments decreased by 22% (from 41 to 32). This shows that the parallel-array approach needs less RAM, thus flushes less often and will cause less segment merges in the long run. So a longer test with actual segment merges would show even bigger gains (with both big and small heaps). So overall, I'm very happy with these results! was (Author: michaelbusch): I did some performance experiments: I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs files, disabled any merging, RAM buffer size = 200MB, single writer thread, TermVectors enabled. h4. Results with -Xmx2000m: || || Write performance || Gain || Number of segments || | trunk | 833 docs/sec | | 41 | | trunk + parallel arrays | 869 docs/sec | {color:green} + 4.3% {color} | 32| h4. Results with -Xmx256m: || || Write performance || Gain || Number of segments || | trunk | 467 docs/sec | | 41 | | trunk + parallel arrays | 871 docs/sec | {color:green} +86.5% {color} | 32| So I think these results are interesting and roughly as expected. 4.3% is a nice small performance gain. But running the tests with a low heap shows how much cheaper the garbage collection becomes. Setting IW's RAM buffer to 200MB and the overall heap to 256MB forces the gc to run frequently. The mark times are much more costly if we have all long-living PostingList objects in memory compared to parallel arrays. So this is probably not a huge deal for "normal" indexing. But once we can search on the RAM buffer it becomes much more attractive to fill up the RAM as much as you can. And exactly in that case we safe a lot with this improvement. Also note that the number of segments decreased by 22% (from 41 to 32). This shows that the parallel-array approach needs less RAM, thus flushes less often and will cause less segment merges in the long run. So a longer test with actual segment merges would show even bigger gains (with both big and small heaps). So overall, I'm very happy with these results! > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848475#action_12848475 ] Michael Busch commented on LUCENE-2329: --- I did some performance experiments: I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs files, disabled any merging, RAM buffer size = 200MB, single writer thread, TermVectors enabled. h4. Results with -Xmx2000m: || || Write performance || Gain || Number of segments || | trunk | 833 docs/sec | | 41 | | trunk + parallel arrays | 869 docs/sec | {color:green} + 4.3% {color} | 32| h4. Results with -Xmx256m: || || Write performance || Gain || Number of segments || | trunk | 467 docs/sec | | 41 | | trunk + parallel arrays | 871 docs/sec | {color:green} +86.5% {color} | 32| So I think these results are interesting and roughly as expected. 4.3% is a nice small performance gain. But running the tests with a low heap shows how much cheaper the garbage collection becomes. Setting IW's RAM buffer to 200MB and the overall heap to 256MB forces the gc to run frequently. The mark times are much more costly if we have all long-living PostingList objects in memory compared to parallel arrays. So this is probably not a huge deal for "normal" indexing. But once we can search on the RAM buffer it becomes much more attractive to fill up the RAM as much as you can. And exactly in that case we safe a lot with this improvement. Also note that the number of segments decreased by 22% (from 41 to 32). This shows that the parallel-array approach needs less RAM, thus flushes less often and will cause less segment merges in the long run. So a longer test with actual segment merges would show even bigger gains (with both big and small heaps). So overall, I'm very happy with these results! > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848226#action_12848226 ] Michael Busch commented on LUCENE-2312: --- I think sync'ing after every doc is probably the better option. We'll still avoid the need to make all variables downstream of DocumentsWriter volatile/atomic, which should be a nice performance gain. The problem with the delayed sync'ing (after e.g. 100 docs) is that if you don't have a never-ending stream of twee... err documents, then you might want to force an explicit sync at some point. But that's very hard, because you would have to force the writer thread to make e.g. a volatile write via an API call. And if that's an IndexWriter writer API that has to trigger the sync on multiple DocumentsWriter instances (i.e. multiple writer threads) I don't see how that's possible unless Lucene manages it's own thread of pools. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848210#action_12848210 ] Michael Busch edited comment on LUCENE-2312 at 3/22/10 5:01 PM: bq. So.. what does this mean for allowing an IR impl to directly search IW's RAM buffer? The main idea is to have an approach that is lock-free. Then write performance will not suffer no matter how big your query load is. When you open/reopen a RAMReader it would first ask the MemoryBarrier for the last sync'ed docID (volatile read). This would be the maxDoc for that reader and it's safe for the reader to read up to that id, because it can be sure that all changes the writer thread made up to that maxDoc are visible to the reader. If we called MemoryBarrier.sync() let's say every 100 docs, then the max. search latency would be the amount of time it takes to index 100 docs. Doing no volatile/atomic writes and not going through explicit locks for 100 docs will allow the JVM to do all its nice optimizations. I think this will work, but honestly I have not really a good feeling for how much performance this approach would gain compared to writing to volatile variables for every document. was (Author: michaelbusch): bq. So.. what does this mean for allowing an IR impl to directly search IW's RAM buffer? The main idea is to have an approach that is lock-free. Then write performance will not suffer no matter how big your query load is. When you open/reopen a RAMReader it would first ask the MemoryBarrier for the last sync'ed docID. This would be the maxDoc for that reader and it's safe for the reader to read up to that id, because it can be sure that all changes the writer thread made up to that maxDoc are visible to the reader. If we called MemoryBarrier.sync() let's say every 100 docs, then the max. search latency would be the amount of time it takes to index 100 docs. Doing no volatile/atomic writes and not going through explicit locks for 100 docs will allow the JVM to do all its nice optimizations. I think this will work, but honestly I have not really a good feeling for how much performance this approach would gain compared to writing to volatile variables for every document. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848210#action_12848210 ] Michael Busch commented on LUCENE-2312: --- bq. So.. what does this mean for allowing an IR impl to directly search IW's RAM buffer? The main idea is to have an approach that is lock-free. Then write performance will not suffer no matter how big your query load is. When you open/reopen a RAMReader it would first ask the MemoryBarrier for the last sync'ed docID. This would be the maxDoc for that reader and it's safe for the reader to read up to that id, because it can be sure that all changes the writer thread made up to that maxDoc are visible to the reader. If we called MemoryBarrier.sync() let's say every 100 docs, then the max. search latency would be the amount of time it takes to index 100 docs. Doing no volatile/atomic writes and not going through explicit locks for 100 docs will allow the JVM to do all its nice optimizations. I think this will work, but honestly I have not really a good feeling for how much performance this approach would gain compared to writing to volatile variables for every document. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848198#action_12848198 ] Michael Busch commented on LUCENE-2312: --- Hi Brian - good to see you on this list! In my previous comment I actually quoted some sections of the concurrency book: https://issues.apache.org/jira/browse/LUCENE-2312?focusedCommentId=12845712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845712 Did I understand correctly that a volatile write can be used to enforce a cache->RAM write-through of *all* updates a thread made that came before the volatile write in the thread's program order? My idea here was to use this behavior to avoid volatile writes for every document, but instead to periodically do such a volatile write (say e.g. every 100 documents). I implemented a class called MemoryBarrier, which keeps track of when the last volatile write happened. A reader thread can ask the MemoryBarrier what the last successfully processed docID before crossing the barrier was. The reader will then never attempt to read beyond that document. Of course there are tons of details regarding safe publication of all involved fields and objects. I was just wondering if this general "memory barrier" approach seems right and if indeed performance gains can be expected compared to doing volatile writes for every document? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2329: -- Attachment: lucene-2329.patch Removed reset(). All tests still pass. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848161#action_12848161 ] Michael Busch commented on LUCENE-2329: --- bq. I think *ParallelPostingsArray.reset do not need to zero-fill the arrays - these are overwritten when that termID is first used, right? Good point! I'll remove the reset() methods. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2329: -- Attachment: lucene-2329.patch Made the memory tracking changes as described in my previous comment. All tests still pass. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848058#action_12848058 ] Michael Busch commented on LUCENE-2329: --- One change I should make to the patch is how to track the memory consumption. When the parallel array is allocated or grown then bytesAllocated() should be called? And when a new termID is added, should only then bytesUsed() be called? > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2329: -- Attachment: lucene-2329.patch Changes the indexer to use parallel arrays instead of PostingList objects (for both FreqProx* and TermVectors*). All core & contrib & bw tests pass. I haven't done performance tests yet. I'm wondering how to manage the size of the parallel array? I started with an initial size for the parallel array equal to the size of the postingsHash array. When it's full then I allocate a new one with 1.5x size. When shrinkHash() is called I also shrink the parallel array to the same size as postingsHash. How does that sound? > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847068#action_12847068 ] Michael Busch commented on LUCENE-2329: --- bq. Hmm the challenge is that the tracking done for term vectors is just within a single doc. Duh! Of course you're right. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847024#action_12847024 ] Michael Busch commented on LUCENE-2329: --- bq. This issue is just about how IndexWriter's RAM buffer stores its terms... Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on *disk*, rather than the strings. It would help things like similarity search and facet counting. {quote} But, note that term vectors today do not store the term char[] again - they piggyback on the term char[] already stored for the postings. {quote} Yeah I think I'm familiar with that part (secondary entry point in TermsHashPerField, hashes based on termStart). Haven't looked much into how the "rest" of the TermVector in-memory data structures are working. {quote} Though, I believe they store "int textStart" (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term) {quote} Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course. This would be a nice simplification, because no hash collisions, no hash table resizing based on load factor, etc. would be necessary for non-primary TermsHashes? bq. so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs... How does the read performance of packed ints compare to "normal" int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance? Oh man, will we need flexible indexing for the in-memory index too? :) > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2329) Use parallel arrays instead of PostingList objects
Use parallel arrays instead of PostingList objects -- Key: LUCENE-2329 URL: https://issues.apache.org/jira/browse/LUCENE-2329 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays. The termsHash will simply be a int[] which maps each term to dense termIDs. All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays. This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection. Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory. Another benefit could be to build more efficient TermVectors. We could avoid the need of having to store the term string per document in the TermVector. Instead we could just store the segment-wide termIDs. This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2324: -- Attachment: lucene-2324-no-pooling.patch All tests pass but I have to review if with the changes the memory consumption calculation still works correctly. Not sure if the junits test that? Also haven't done any performance testing yet. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2324-no-pooling.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846586#action_12846586 ] Michael Busch commented on LUCENE-2324: --- bq. Michael, Agreed, can you outline how you think we should proceed then? Sorry for not responding earlier... I'm currently working on removing the PostingList object pooling, because it makes TermsHash and TermsHashPerThread much easier. Have written the patch and all tests pass, though I haven't done performance testing yet. Making TermsHash and TermsHashPerThread smaller will also make the patch here easier which will remove them. I'll post the patch soon. Next steps I think here are to make everything downstream of DocumentsWriter single-threaded (removal of *PerThread) classes. Then we need to write the DocumentsWriterThreadBinder and have to think about how to apply deletes, commits and rollbacks to all DocumentsWriter instances. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846128#action_12846128 ] Michael Busch commented on LUCENE-2324: --- I think we all agree that we want to have a single writer thread, multi reader thread model. Only then the thread-safety problems in LUCENE-2312 can be reduced to visibility (no write-locking). So I think making this change first makes most sense. It involves a bit boring refactoring work unfortunately. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846084#action_12846084 ] Michael Busch commented on LUCENE-2324: --- Shall we not first try to remove the downstream *PerThread classes and make the DocumentsWriter single-threaded without locking. Then we add a PerThreadDocumentsWriter and DocumentsWriterThreadBinder, which talks to the PerThreadDWs and IW talks to DWTB. We can pick other names :) When that's done we can think about what kind of locking/synchronization/volatile stuff we need for LUCENE-2312. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845978#action_12845978 ] Michael Busch commented on LUCENE-2312: --- {quote} think we simply need a way to publish byte arrays to all threads? Michael B. can you post something of what you have so we can get an idea of how your system will work (ie, mainly what the assumptions are)? {quote} It's kinda complicated to explain and currently differs from Lucene's TermHash classes a lot. I'd prefer to wait a little bit until I have verified that my solution works. I think here we should really tackle LUCENE-2324 first - it's a prereq. Wanna help with that, Jason? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845969#action_12845969 ] Michael Busch commented on LUCENE-2312: --- {quote} I thought we're moving away from byte block pooling and we're going to try relying on garbage collection? Does a volatile object[] publish changes to all threads? Probably not, again it'd just be the pointer. {quote} We were so far only considering moving away from pooling of (Raw)PostingList objects. Pooling byte blocks might have more performance impact - they're more heavy-weight. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845745#action_12845745 ] Michael Busch commented on LUCENE-2312: --- The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks). I think I have a lock-free solution working, which only syncs in certain intervals to not prevent JVM optimizations - but I need more time for thinking about all the combinations and corner cases. It's getting late now - need to sleep! > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845745#action_12845745 ] Michael Busch edited comment on LUCENE-2312 at 3/16/10 6:51 AM: The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks). I think I have a lock-free solution working, which only syncs (i.e. does volatile writes) in certain intervals to not prevent JVM optimizations - but I need more time for thinking about all the combinations and corner cases. It's getting late now - need to sleep! was (Author: michaelbusch): The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks). I think I have a lock-free solution working, which only syncs in certain intervals to not prevent JVM optimizations - but I need more time for thinking about all the combinations and corner cases. It's getting late now - need to sleep! > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845731#action_12845731 ] Michael Busch commented on LUCENE-2312: --- {quote} Do volatile byte arrays work {quote} I'm not sure what you mean by volatile byte arrays? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845731#action_12845731 ] Michael Busch edited comment on LUCENE-2312 at 3/16/10 6:12 AM: {quote} Do volatile byte arrays work {quote} I'm not sure what you mean by volatile byte arrays? Do you mean this? {code} volatile byte[] array; {code} This makes the *reference* to the array volatile, not the slots in the array. was (Author: michaelbusch): {quote} Do volatile byte arrays work {quote} I'm not sure what you mean by volatile byte arrays? > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845726#action_12845726 ] Michael Busch commented on LUCENE-2312: --- {quote} A quick and easy way to solve this is to use a read write lock on the byte pool? {quote} If you use a RW lock then the writer thread will block all reader threads while it's making changes. The writer thread will be making changes all the time in a real-time search environment. The contention will kill performance I'm sure. RW lock is only faster than mutual exclusion lock if writes are infrequent, as mentioned in the javadocs of ReadWriteLock.java > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845712#action_12845712 ] Michael Busch commented on LUCENE-2312: --- {quote} Hmm... what does JMM say about byte arrays? If one thread is writing to the byte array, can any other thread see those changes? {quote} This is the very right question to ask here. Thread-safety is really the by far most complicated aspect of this feature. Jason, I'm not sure if you already figured out how to ensure visibility of changes made by the writer thread to the reader threads? Thread-safety in our case boils down to safe publication. We don't need locking to coordinate writing of multiple threads, because of LUCENE-2324. But we need to make sure that the reader threads see all changes they need to see at the right time, in the right order. This is IMO very hard, but we all like challenges :) The JMM gives no guarantee whatsover what changes a thread will see that another thread made - or if it will ever see the changes, unless proper publication is ensured by either synchronization or volatile/atomic variables. So e.g. if a writer thread executes the following statements: {code} public static int a, b; ... a = 1; b = 2; a = 5; b = 6; {code} and a reader threads does: {code} System.out.println(a + "," + b); {code} The thing to remember is that the output might be: 1,6! Another reader thread with the following code: {code} while (b != 6) { .. do something } {code} might further NEVER terminate without synchronization/volatile/atomic. The reason is that the JVM is allowed to perform any reorderings to utilize modern CPUs, memory, caches, etc. if not forced otherwise. To ensure safe publication of data written by a thread we could do synchronization, but my goal is it here to implement a non-blocking and lock-free algorithm. So my idea was it to make use of a very subtle behavior of volatile variables. I will take a simple explanation of the JMM from Brian Goetz' awesome book "Java concurrency in practice", in which he describes the JMM in simple happens-before rules. I will mention only three of those rules, because they are enough to describe the volatile behavior I'd like to mention here (p. 341) *Program order rule:* Each action in a thread _happens-before_ every action in that thread that comes later in the program order. *Volatile variable rule:* A write to a volatile field _happens-before_ every subsequent read of that same field. *Transitivity:* If A happens-before B, and B _happens-before_ C, then A _happens-before_ C. Based on these three rules you can see that writing to a volatile variable v by one thread t1 and subsequent reading of the same volatile variable v by another thread t2 publishes ALL changes of t1 that happened-before the write to v and the change of v itself. So this write/read of v means crossing a memory barrier and forcing everything that t1 might have written to caches to be flushed to the RAM. That's why a volatile write can actually be pretty expensive. Note that this behavior is actually only working like I just described since Java 1.5. Behavior of volatile variables was a very very subtle change from 1.4->1.5! The way I'm trying to make use of this behavior is actually similar to how we lazily sync Lucene's files with the filesystem: I want to delay the cache->RAM write-through as much as possible, which increases the probability of getting the sync for free! Still fleshing out the details, but I wanted to share these infos with you guys already, because it might invalidate a lot of assumptions you might have when developing the code. Some of this stuff was actually new to me, maybe you all know it already. And if anything that I wrote here is incorrect, please let me know! Btw: IMO, if there's only one java book you can ever read, then read Goetz' book! It's great. He also says in the book somewhere about lock-free algorithms: "Don't try this at home!" - so, let's do it! :) > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845703#action_12845703 ] Michael Busch commented on LUCENE-2312: --- {quote} Sounds like awesome progress!! Want some details over here :) {quote} Sorry for not being very specific. The prototype I'm experimenting with has a fixed length postings format for the in-memory representation (in TermsHash). Basically every posting has 4 bytes, so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are used for an absolute docID (not delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1 remaining byte is used for the position. With a max doc length of 140 characters you can fit every possible position in a byte - what a luxury! :) If a term occurs multiple times in the same doc, then the TermDocs just skips multiple occurrences with the same docID and increments the freq. Again, the same term doesn't occur often in super short docs. The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards pointers. In real-time search you often want a strongly time-biased ranking. A PostingList object has a pointer that points to the last posting (this statement is not 100% correct for visibility reasons across threads, but we can imagine it this way for now). A TermDocs can now traverse the postinglists in opposite order. Skipping can be done by following pointers to previous slices directly, or by binary search within a slice. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845400#action_12845400 ] Michael Busch commented on LUCENE-2324: --- {quote} Sounds great - let's test it in practice. {quote} I have to admit that I need to catch up a bit on the flex branch. I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code? Is it as fast as trunk already, or are there related nocommits left that affect indexing performance? I would think not much of the flex changes should affect the in-memory indexing performance (in TermsHash*). > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845398#action_12845398 ] Michael Busch edited comment on LUCENE-2324 at 3/15/10 4:34 PM: Reply to Mike's comment on LUCENE-2293: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12845263&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845263 {quote} I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed). EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting. {quote} Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once. {quote} I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...). And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access. {quote} Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs). Though, the price you pay is for dereferencing multiple times for each array? And how much RAM would we safe? The pointer for the PostingList object (4-8 bytes), plus the size of the object header - how much is that in Java? Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it? was (Author: michaelbusch): {quote} I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed). EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting. {quote} Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once. {quote} I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...). And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access. {quote} Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs). Though, the price you pay is for dereferencing multiple times for each array? And how much RAM would we safe? The pointer for the PostingList object (4-8 bytes), plus the size of the object header - how much is that in Java? Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845398#action_12845398 ] Michael Busch commented on LUCENE-2324: --- {quote} I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed). EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting. {quote} Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once. {quote} I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...). And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access. {quote} Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs). Though, the price you pay is for dereferencing multiple times for each array? And how much RAM would we safe? The pointer for the PostingList object (4-8 bytes), plus the size of the object header - how much is that in Java? Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845391#action_12845391 ] Michael Busch commented on LUCENE-2293: --- I'll reply on LUCENE-2324. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2293.patch > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845199#action_12845199 ] Michael Busch commented on LUCENE-2324: --- Here is an interesting article about allocation/deallocation on modern JVMs: http://www.ibm.com/developerworks/java/library/j-jtp09275.html And here is a snippet that mentions how pooling is generally not faster anymore: Allocation in JVMs was not always so fast -- early JVMs indeed had poor allocation and garbage collection performance, which is almost certainly where this myth got started. In the very early days, we saw a lot of "allocation is slow" advice -- because it was, along with everything else in early JVMs -- and performance gurus advocated various tricks to avoid allocation, such as object pooling. (Public service announcement: Object pooling is now a serious performance loss for all but the most heavyweight of objects, and even then it is tricky to get right without introducing concurrency bottlenecks.) However, a lot has happened since the JDK 1.0 days; the introduction of generational collectors in JDK 1.2 has enabled a much simpler approach to allocation, greatly improving performance. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845190#action_12845190 ] Michael Busch commented on LUCENE-2293: --- OK I opened LUCENE-2324. We can close this one after you committed your patch, Mike. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2293.patch > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and "normal" segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845159#action_12845159 ] Michael Busch commented on LUCENE-2293: --- bq. How about a new issue? OK, will open one. bq. (if Zipf's law is applying, half the terms should be singletons; if it's not, you could have many more singleton terms...) Yeah we should utilize our knowledge of term distribution to optimize in-memory postings. For example, currently a nice optimization would be to store the first posting in the PostingList object and only allocate slices once you see the second occurrence (similar to the pulsing codec)? bq. Though... to reduce our per-unique-term RAM cost, we may want to move away from separate postings object per term to parallel arrays. What exactly do you mean with parallel arrays? Parallel to the termHash array? Then the termsHash array would not be an array of PostingList objects anymore, but an array of pointers into the char[] array? And you'd have e.g. a parallel int[] array for df, another int[] for pointers into the postings byte pool, etc? Something like that? > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2293.patch > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845157#action_12845157 ] Michael Busch commented on LUCENE-2312: --- {quote} Michael are you also going to [first] tackle truly separating the RAM segments? I think we need this first ... {quote} Yeah I agree. I started working on a patch for separating the doc writers already. I also have a separate indexing chain prototype working with searchable RAM buffer (single-threaded), but slightly different postinglist format (some docs nowadays only have 140 characters ;) ). It seems really fast. I spent a long time thinking about lock-free algorithms and data structures, so indexing performance should be completely independent of the search load (in theory). I need to think a bit more about how to make it work with "normal" documents and Lucene's current in-memory format. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845155#action_12845155 ] Michael Busch commented on LUCENE-2312: --- Well, we need to keep our transactional semantics. So I assume while a flush will happen per doc writer independently, a commit will trigger all (per thread) doc writers to flush. Then a rollback also has to abort all per thread doc writers. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845048#action_12845048 ] Michael Busch commented on LUCENE-2293: --- I'm tempted to get rid of the pooling for PostingLIst objects. The objects are very small and Java does a good job since 1.5 with object creation and gc. I even read that the JVM guys think that pooling can be slower than not-pooling. Also, I've mostly seen gc performance problems so far if there were a big number of long-living objects - it makes the mark time of the garbage collection very long. Pooling of course exactly gets you in such a situation. So what do you think about removing the pooling of the PostingList objects? > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2293.patch > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845047#action_12845047 ] Michael Busch commented on LUCENE-2293: --- {quote} but does anyone out there wanna work out the "private RAM segments"? {quote} Shall we use this issue for the private RAM segments? Or do you want to commit the simple patch, close this one and open a new issue? > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2293.patch > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2312: -- Fix Version/s: (was: 3.0.2) 3.1 > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845032#action_12845032 ] Michael Busch commented on LUCENE-2312: --- I'll try to tackle this one! > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.0.2 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch reassigned LUCENE-2312: - Assignee: Michael Busch > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen >Assignee: Michael Busch > Fix For: 3.0.2 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845031#action_12845031 ] Michael Busch commented on LUCENE-2312: --- {quote} Also, we could store the first docID stored into the term, too - this way we could have a ordered collection of terms, that's shared across several open readers even as changes are still being made, but each reader skips a given term if its first docID is greater than the maxDoc it's searching. That'd give us point in time searching even while we add terms with time... {quote} Exactly. This is what I meant in my comment: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 But I mistakenly said lastDocID; of course firstDocID is correct. > Search on IndexWriter's RAM Buffer > -- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 3.0.1 >Reporter: Jason Rutherglen > Fix For: 3.0.2 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)
[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842502#action_12842502 ] Michael Busch commented on LUCENE-2302: --- Hmm maybe this is too much magic? Wouldn't it be simpler to have two completely separate attributes? E.g. CharTermAttribute and ByteTermAttribute. Plus an API in the indexer that specifies which one to use? > Replacement for TermAttribute+Impl with extended capabilities (byte[] > support, CharSequence, Appendable) > > > Key: LUCENE-2302 > URL: https://issues.apache.org/jira/browse/LUCENE-2302 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: Flex Branch >Reporter: Uwe Schindler > Fix For: Flex Branch > > > For flexible indexing terms can be simple byte[] arrays, while the current > TermAttribute only supports char[]. This is fine for plain text, but e.g > NumericTokenStream should directly work on the byte[] array. > Also TermAttribute lacks of some interfaces that would make it simplier for > users to work with them: Appendable and CharSequence > I propose to create a new interface "CharTermAttribute" with a clean new API > that concentrates on CharSequence and Appendable. > The implementation class will simply support the old and new interface > working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of > this. So if somebody adds a TermAttribute, he will get an implementation > class that can be also used as CharTermAttribute. As both attributes create > the same impl instance both calls to addAttribute are equal. So a TokenFilter > that adds CharTermAttribute to the source will work with the same instance as > the Tokenizer that requested the (deprecated) TermAttribute. > To also support byte[] only terms like Collation or NumericField needs, a > separate getter-only interface will be added, that returns a reusable > BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will > also support this interface. For backwards compatibility with old > self-made-TermAttribute implementations, the indexer will check with > hasAttribute(), if the BytesRef getter interface is there and if not will > wrap a old-style TermAttribute (a deprecated wrapper class will be provided): > new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the > indexer then. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841923#action_12841923 ] Michael Busch commented on LUCENE-2293: --- bq. So about the int[], would that be of the size of the index (flushed and unflushed) segments? Suppose that: Each DW would have its own int[]. The size would correspond to the number of docs the DW has in its buffer. {quote} I've indexed 5 documents, flushed. (IDs 0-4) Indexed 2 on DW1. (IDs 0,1) Indexed 2 on DW2. (IDs 0,1) Delete by term which affects: flushed IDs 1, 4, DW1-0, DW2 - 0, 1 Would the int[] be of size 9, and the deleted IDs be 1, 4, 5, 7, 8? How would DW1- be mapped to 5, and DW2-0,1 be mapped to 7 and 8? Will the int[] be initially of size 5 and after DW1 flushes expand to 7, and ID=5 will be set (and afterwards expand to 9 with IDs 7,8)? If so then I understand. {quote} DW1 will have an int[] of size 2, and DW2 will also have a separate int[] of size 2. I think you were thinking of one big int[] across the entire index? I believe you will understand the whole approach now when you think of the int[]s as per ram segment. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841915#action_12841915 ] Michael Busch commented on LUCENE-2293: --- {quote} This is a great approach for speeding up NRT - NRT readers will no longer have to flush. It's similar in spirit to LUCENE-1313, but that issue is still flushing segments (but, into an intermediate RAMDir). {quote} I agree! Thinking further about this: Each (re)opened RAM segment reader needs to also remember the maxDoc of the corresponding DW at the time it was (re)opened. This way we can prevent a RAM reader to read postinglists beyond that maxDoc, even if the writer thread keeps building the lists in parallel. This allows us to guarantee the point-in-time requirements. Also, the PostingList objects we store in the TermHash already contain a lastDocID (if I remember correctly). So when a RAM reader termEnum iterates the dictionary it can skip all terms where term.lastDocID > RAMReader.maxDoc. It's quite neat that all we have to do in reopen then is to update ramReader.maxDoc and ramReader.seqID. Of course one big thing is still missing: keeping the term dictionary sorted. In order to implement the full IndexReader interface, specifically TermEnum, it's necessary to give each RAM reader a point-in-time sorted dictionary. At least in one direction, as a TermEnum only seeks forward. I think we have two options here: Either we try to keep the dictionary always sorted, whenever a term is added. I guess then we'd have to implement a b-tree or something similar? The second option I can think of is to add a "nextTerm" pointer to TermHash.Postinglist. This allows us to build up a linked list across all terms. When a ramReader is opened we would sort all terms, but not by changing their position in the hash - instead by building the single-linked list in sorted order. When a new reader gets (re)opened we need to mergesort the new terms into the linked list. I guess it's easy to get this implemented lock-free. E.g. if you have the linked list a->c, and you want to add b in the middle, you set b->c before changing a->c. Then it's undefined if an in-flight older reader would see term b. The old reader must not return b, since b was added after the old reader was (re)opened. So either case is fine: either it doesn't see b cause the link wasn't updated yet, or it sees it but doesn't return it, because b.lastDocID>ramReader.maxDoc. The downside is that we will have to pay the price of sorting in reader.reopen, which however should be cheap if readers are reopened frequently. Not sure though if this linkedlist approach is more or less compelling than something like a btree? Btw: Shall we open a new "searchable DW buffer" issue or continue using this issue for these discussions? > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841745#action_12841745 ] Michael Busch commented on LUCENE-2293: --- {quote} Won't this complicate the entire solution? What I liked about keeping each DW separate (and call it SegmentWriter) is that it really operates on its own. When a delete happens on IW, it is synced so that it could be registered on all DWs. But besides that, the DWs don't know about each other nor care. Code should be really simple that way - the only thing that will be shared is the pool of buffers. {quote} What I'm proposing is not different or makes it more complicated. Either way, you have to apply all deletes on all DWs, because you delete by query or term. This might not be the right time for this proposal, because it'll only work with searchable DW buffers. But I wanted to mention this idea already, so that we can keep it in mind. And hopefully we can work on searchable DW buffers soon. {quote} but does anyone out there wanna work out the "private RAM segments"? {quote} I would like to try to help, but I'm likely not going to have enough time right now to write an entire patch for this big change myself. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841744#action_12841744 ] Michael Busch commented on LUCENE-2293: --- {quote} But if each DW maintains its own doc IDs, separately from the others, what will be stored in the int[]? DW1 deleted docID 0 (its 0) and DW4 deleted the same. The two documents are not the same one ... no? {quote} In DW you don't delete by docID. You can only delete by term or query. You have to run the (term)query in all DWs to determine if any of the DWs have one or more matching docs that have to be deleted. Today the queries and/or terms are buffered, along with the maxDocID at the time the delete or update was called. They are applied just after the DW buffer was flushed to a segment, be cause that's the first time the docs are searchable and the delete queries can be executed. In the future, when we can search the DW buffer(s), you can apply the deletes right away. Using this int[] approach for deletes will avoid the need of cloning bitsets in each reopen. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841617#action_12841617 ] Michael Busch commented on LUCENE-2293: --- bq. The big advantage is that all (re)opened readers can share the single int[] array. Dirty reads will be a problem with sharing the array. An AtomicIntegerArray could be used. We need to experiment how expensive that would be. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841545#action_12841545 ] Michael Busch commented on LUCENE-2293: --- {quote} I thought that when (3) happens, the delete-by-term needs to be issued against all DWs, so that later when they apply their deletes they'll remember to do so. Issuing that against all DWs will record the docID of each DW up until which the delete should apply. {quote} Yes, you still need to apply deletes on all DWs. My approach is not different in that regard. {quote} Also, I don't see the advantage of moving to store the deletes in int[] rather than bitset ... is it just to avoid calling the get(doc)? {quote} The big advantage is that all (re)opened readers can share the single int[] array. If you use a bitset you need to clone it for each reader. With the int[] reopening becomes basically free from a deletes perspective. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841407#action_12841407 ] Michael Busch commented on LUCENE-2293: --- bq. Yes, I think each DW will have to record its own buffered delete Term/Query, mapping to its docID at the time the delete arrived. I think in the future deletes in DW could work like this: - DW keeps of course track of a private sequence id, which gets incremented in the add, delete, update calls - a DW has a getReader() call, the reader can search the ram buffer - when DW.gerReader() gets called, then the new reader remembers the current seqID at the time it was opened - let's call it RAMReader.seqID; if such a reader gets reopened, simply its seqID gets updated. - we keep an growing int array with the size of DW's maxDoc, which replaces the usual deletes bitset - when DW.updateDocument() or .deleteDocument() needs to delete a doc we do that right away, before inverting the new doc. We can do that by running a query using a RAMReader to find all docs that must be deleted. Instead of flipping a bit in a bitset, for each hit we now keep track of when it was deleted: {code} // init each slot in deletes array with -1 static final int NOT_DELETED = Integer.MAX_INT; ... Arrays.fill(deletes, NOT_DELETED); ... public void deleteDocument(Query q) { reopen RAMReader run query q using RAMReader for each hit { int hitDocId = ... if (deletes[hitDocId] == NOT_DELETED) { deletes[hitDocId] = DW.seqID; } } ... DW.seqID++; } {code} Now no matter of how often you (re)open RAMReaders, they can share the deletes array. No cloning like with the BitSet approach would be necessary: When the RAMReader iterates posting lists it's as simple as this to treat deletes docs correctly. Instead of doing this in RAMTermDocs.next(): {code} if (deletedDocsBitSet.get(doc)) { skip this doc } {code} we can now do: {code} if (deletes[doc] < ramReader.seqID) { skip this doc } {code} Here is an example: 1. Add 3 docs with DW.addDocument() 2. User opens ramReader_a 3. Delete doc 1 4. User opens ramReader_b After 1: DW.seqID = 2; deletes[]={MAX_INT, MAX_INT, MAX_INT} After 2: ramReader_a.seqID = 2 After 3: DW.seqID = 3; deletes[]={MAX_INT, 2, MAX_INT} After 3: ramReader_b.seqID = 3 Note that both ramReader_a and ramReader_b share the same deletes[] array. Now when ramReader_a is used to read posting lists, it will not treat doc 1 as deleted, because (deletes[1] < ramReader_a.seqID) = (2 < 2) = false; But ramReader_b will see it as deleted, because (deletes[1] < ramReader_b.seqID) = (2 < 3) = true. What do you think about this approach for the future when we have a searchable DW buffer? > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, to
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841388#action_12841388 ] Michael Busch commented on LUCENE-2293: --- {quote} But, I was proposing a bigger change (call it "private RAM segments"): there would be multiple DWs, each one writing to its own private RAM segment (each one getting private docID assignment) and its own doc stores. {quote} Cool! I wasn't sure if you wanted to give them private doc stores too. +1, I like it. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841135#action_12841135 ] Michael Busch commented on LUCENE-2293: --- Sorry - after reading my comment again I can see why it was confusing. Loadbalancer wasn't a very good analogy. I totally agree that Lucene should still piggyback on the application's threads and not start its own thread for document inversion. Today, as you said, does the DocumentsWriter manage a certain number of thread states, has the WaitQueue, and its own memory management. What I was thinking was that it would be simpler if the DocumentsWriter was only used by a single thread. The IndexWriter would have multiple DocumentsWriters and do the thread binding (+waitqueue). This would make the code in DocumentsWriter and the downstream classes simpler. The side-effect is that each DocumentsWriter would manage its own memory. {quote} Also, I thought that each thread writes to different ThreadState does not ensure documents are written in order, but that finally when DW flushes, the different ThreadStates are merged together and one segment is written, somehow restores the orderness ... {quote} Stored fields are written to an on-disk stream (docstore) in order. The WaitQueue takes care of finishing the docs in the right order. The postings are written into TermHashes per threadstate in parallel. The doc ids are in increasing order, but can have gaps. E.g. Threadstate 1 inverts doc 1 and 3, Threadstate 2 inverts doc 2. When it's time to flush the whole buffer these different TermHash postingslists get interleaved. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841120#action_12841120 ] Michael Busch commented on LUCENE-2293: --- {quote} Also, in the pull approach, Lucene would introduce another place where it allocates threads. {quote} What I described is not much different from what's happening today. DocumentsWriter has already a WaitQueue, that ensures that the docs are written in the right order. I simply tried to suggest a way to refactor our classes... functionally the same as what Mike suggested. I shouldn't have said "pulled from" (the queue). > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840952#action_12840952 ] Michael Busch commented on LUCENE-2293: --- bq. I hope we won't lose monotonic docIDs for a singlethreaded indexation somewhere along that path. No. The order in the single threaded case won't be different from today with the changes Mike is proposing. > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840911#action_12840911 ] Michael Busch commented on LUCENE-2293: --- Good timing - a couple days ago I was thinking about how threading could be changed in the indexer. The other downside is that you would have to buffer deleted docs and queries separately for each thread state, because you have to keep the private docID? So that would nee a bit more memory. Couldn't we make the DocumentsWriter and all related down-stream classes single-threaded then? The IndexWriter (or a new class) would have the doc queue, basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document? This would allow us to simplify the indexer chain a lot - we could get rid of all the *PerThread classes. We'd also have to separate then the docstores from the DocumentsWriter, so that multiple DocumentsWriter instances could share it. (what I'd like to do anyway for LUCENE-2026 anyway). > IndexWriter has hard limit on max concurrency > - > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2126: -- Attachment: lucene-2126.patch Updated patch to trunk. I'll have to make a change to the backwards-tests too, because moving the copyBytes() method from IndexOutput to DataOutput and changing its parameter from IndexInput to DataInput breaks drop-in compatibility. > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch, lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795964#action_12795964 ] Michael Busch commented on LUCENE-2126: --- There has been silence here, so I hope everyone is ok with this change now? I'll commit this in a day or two if nobody objects! > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795963#action_12795963 ] Michael Busch commented on LUCENE-2186: --- Great to see progress here, Mike! {quote} String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). {quote} It looks like ByteRef is very similar to Payload? Could you use that instead and extend it with the new String constructor and compare methods? {quote} It handles 3 types of values: {quote} So it looks like with your approach you want to support certain "primitive" types out of the box, such as byte[], float, int, String? If someone has custom data types, then they have, similar as with payloads today, the byte[] indirection? The code I initially wrote for 1231 exposed IndexOutput, so that one can call write*() directly, without having to convert to byte[] first. I think we will also want to do that for 2125 (store attributes in the index). So I'm wondering if this and 2125 should work similarly? Thinking out loud: Could we have then attributes with serialize/deserialize methods for primitive types, such as float? Could we efficiently use such an approach all the way up to FieldCache? It would be compelling if you could store an attribute as CSF, or in the postinglist, retrieve it from the flex APIs, and also from the FieldCache. All would be the same API and there would only be one place that needs to "know" about the encoding (the attribute). {quote} Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. {quote} Yeah, that's where I got kind of stuck with 1231: We need to figure out how the public API should look like, with which a user can add CSF values to the index and retrieve them. The easiest and fastest way would be to add a dedicated new API. The cleaner one would be to make the whole Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt. {quote} There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) {quote} Hmm, so random-access would obviously be the preferred approach for SSDs, but with conventional disks I think the performance would be poor? In 1231 I implemented the var-sized CSF with a skip list, similar to a posting list. I think we should add that here too and we can still keep the additional index that stores the pointers? We could have two readers: one that allows random-access and loads the pointers into RAM (or uses MMAP as you mentioned), and a second one that doesn't load anything into RAM, uses the skip lists and only allows iterator-based access? About updating CSF: I hope we can use parallel indexing for that. In other words: It should be possible for users to use parallel indexes to update certain fields, and Lucene should use the same approach internally to store different "generations" of things like norms and CSFs. > First cut at column-stride fields (index values storage) > > > Key: LUCENE-2186 > URL: https://issues.apache.org/jira/browse/LUCENE-2186 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2186.patch > > > I created an initial basic impl for storing "index values" (ie > column-stride value storage). This is still a work in progress... but > the approach looks compelling. I'm posting my current status/patch > here to get feedback/iterate, etc. > The code is standalone now, and lives under new package > oal.index.values (plus some util changes, refactorings) -- I have yet > to integrate into Lucene so eg you can mark that a given Field's value > should be stored into the index values, sorting will use these values > instead of field cache, etc. > It handles 3 types of values: > * Six variants of byte[] per doc, all combinations of fixed vs > variable length, and stored either "straight" (good for eg a > "title" field), "deref" (good when many docs share the same value, > but you won't do any sorting) or "sorted". > * Integers (variable bit precision used as necessary, ie this can > store byte/short/int/long, and all precisions in between) > * Floats (4 or 8 byte precision) > String fields are stored as the UTF8 byte[]. This patch adds a > Bytes
[jira] Commented: (LUCENE-2182) DEFAULT_ATTRIBUTE_FACTORY faills to load implementation class when iterface comes from different classloader
[ https://issues.apache.org/jira/browse/LUCENE-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794830#action_12794830 ] Michael Busch commented on LUCENE-2182: --- Looks like a good solution! Thanks for taking care of this, Uwe! {quote} Should we backport this to 2.9 and 3.0 (which is easy)? {quote} +1 > DEFAULT_ATTRIBUTE_FACTORY faills to load implementation class when iterface > comes from different classloader > > > Key: LUCENE-2182 > URL: https://issues.apache.org/jira/browse/LUCENE-2182 > Project: Lucene - Java > Issue Type: Bug > Components: Other >Affects Versions: 2.9.1, 3.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2182.patch > > > This is a followup for > [http://www.lucidimagination.com/search/document/1724fcb3712bafba/using_the_new_tokenizer_api_from_a_jar_file]: > The DEFAULT_ATTRIBUTE_FACTORY should load the implementation class for a > given attribute interface from the same classloader like the attribute > interface. The current code loads it from the classloader of the > lucene-core.jar file. In solr this fails when the interface is in a JAR file > coming from the plugins folder. > The interface is loaded correctly, because the > addAttribute(FooAttribute.class) loads the FooAttribute.class from the plugin > code and this with success. But as addAttribute tries to load the class from > its local lucene-core.jar classloader it will not find the attribute. > The fix is to tell Class.forName to use the classloader of the corresponding > interface, which is the correct way to handle it, as the impl and the > attribute should always be in the same classloader and file. > I hope I can somehow add a test for that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789946#action_12789946 ] Michael Busch commented on LUCENE-2126: --- {quote} So first, can we perhaps name them otherwise, like LuceneInput/Output or something similar, to not confuse w/ Java's? {quote} Hmm, I was a bit concerned about confusion first too. But I'm, like Mark, not really liking LuceneInput/Output. I'd personally be ok with keeping DataInput/Output. But maybe we can come up with something better? Man, naming is always so hard... :) {quote} Second, why not have them implement Java's DataInput/Output, and add on top of them additional methods, like readVInt(), readVLong() etc.? {quote} I considered that, but Java's interfaces dictate what string encoding to use: (From java.io.DataInput's javadocs) {noformat} Implementations of the DataInput and DataOutput interfaces represent Unicode strings in a format that is a slight modification of UTF-8. {noformat} E.g. DataInput defines readChar(), which we'd have to implement. But in IndexInput we deprecated readChars(), because we don't want to use modified UTF-8 anymore. > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789944#action_12789944 ] Michael Busch commented on LUCENE-2126: --- {quote} What does a "normal" user do with a file? Step 1: Open the file. Step 2: Write data to the file. Step 3: Close the file. Then, later... Step 1: Open the file. Step 2: Read data from the file. Step 3: Close the file. You're saying that Lucene's file abstraction is easier to understand if you break that up? {quote} No, I'm saying "normal" users do not work directly with files, so they won't do any of your steps above. They don't need to know those I/O related classes (except Directory). DataInput/Output is about encoding/decoding of data, which is all a user of 2125 needs to worry about. The user doesn't have to know that the attribute is first serialized into byte slices in TermsHashPerField and then written into the file(s) the actual codec defines. {quote} But the idea that this strange fragmentation of the IO hierarchy makes things easier - I don't get it at all. And I certainly don't see how it's such an improvement over what exists now that it justifies a change to the public API. {quote} It makes it easier for a 2125 user. It does not make it harder for someone "advanced" who's dealing with IndexInput/Output already. It makes it also cleaner - look e.g. at ByteSliceReader/Writer: those classes just currently throw RuntimeExceptions in the methods that this patch leaves in IndexInput/Output. Why? Because they're not dealing with file I/O, but with data (de)serialization. > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789834#action_12789834 ] Michael Busch commented on LUCENE-2126: --- I disagree with you here: introducing DataInput/Output makes IMO the API actually easier for the "normal" user to understand. I would think that most users don't implement IndexInput/Output extensions, but simply use the out-of-the-box Directory implementations, which provide IndexInput/Output impls. Also, most users probably don't even call the IndexInput/Output APIs directly. {quote} Do nothing and assume that the sort of advanced user who writes a posting codec won't do something incredibly stupid like call indexInput.close(). {quote} Writing a posting code is much more advanced compared to using 2125's features. Ideally, a user who simply wants to store some specific information in the posting list, such as a boost, a part-of-speech identifier, another VInt, etc. should with 2125 only have to implement a new attribute including the serialize()/deserialize() methods. People who want to do that don't need to know anything about Lucene's API layer. They only need to know the APIs that DataInput/Output provide and will not get confused with methods like seek() or close(). For the standard user who only wants to write such an attribute it should not matter how Lucene's IO structure looks like - so even if we make changes that go into Lucy's direction in the future (IndexInput/Output owning a filehandling vs. the need to extend them) the serialize()/deserialize() methods of attribute would still work with DataInput/Output. I bet that a lot of people who used the payload feature before took a ByteArrayOutputStream together with DataOutputStream (which implements Java's DataOutput) to populate the payload byte array. With 2125 Lucene will provide an API that is similar to use, but more efficient as it remove the byte[] array indirection and overhead. I'm still +1 for this change. Others? > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789834#action_12789834 ] Michael Busch edited comment on LUCENE-2126 at 12/13/09 1:22 AM: - I disagree with you here: introducing DataInput/Output makes IMO the API actually easier for the "normal" user to understand. I would think that most users don't implement IndexInput/Output extensions, but simply use the out-of-the-box Directory implementations, which provide IndexInput/Output impls. Also, most users probably don't even call the IndexInput/Output APIs directly. {quote} Do nothing and assume that the sort of advanced user who writes a posting codec won't do something incredibly stupid like call indexInput.close(). {quote} Writing a posting code is much more advanced compared to using 2125's features. Ideally, a user who simply wants to store some specific information in the posting list, such as a boost, a part-of-speech identifier, another VInt, etc. should with 2125 only have to implement a new attribute including the serialize()/deserialize() methods. People who want to do that don't need to know anything about Lucene's API layer. They only need to know the APIs that DataInput/Output provide and will not get confused with methods like seek() or close(). For the standard user who only wants to write such an attribute it should not matter how Lucene's IO structure looks like - so even if we make changes that go into Lucy's direction in the future (IndexInput/Output owning a filehandle vs. the need to extend them) the serialize()/deserialize() methods of attribute would still work with DataInput/Output. I bet that a lot of people who used the payload feature before took a ByteArrayOutputStream together with DataOutputStream (which implements Java's DataOutput) to populate the payload byte array. With 2125 Lucene will provide an API that is similar to use, but more efficient as it remove the byte[] array indirection and overhead. I'm still +1 for this change. Others? was (Author: michaelbusch): I disagree with you here: introducing DataInput/Output makes IMO the API actually easier for the "normal" user to understand. I would think that most users don't implement IndexInput/Output extensions, but simply use the out-of-the-box Directory implementations, which provide IndexInput/Output impls. Also, most users probably don't even call the IndexInput/Output APIs directly. {quote} Do nothing and assume that the sort of advanced user who writes a posting codec won't do something incredibly stupid like call indexInput.close(). {quote} Writing a posting code is much more advanced compared to using 2125's features. Ideally, a user who simply wants to store some specific information in the posting list, such as a boost, a part-of-speech identifier, another VInt, etc. should with 2125 only have to implement a new attribute including the serialize()/deserialize() methods. People who want to do that don't need to know anything about Lucene's API layer. They only need to know the APIs that DataInput/Output provide and will not get confused with methods like seek() or close(). For the standard user who only wants to write such an attribute it should not matter how Lucene's IO structure looks like - so even if we make changes that go into Lucy's direction in the future (IndexInput/Output owning a filehandling vs. the need to extend them) the serialize()/deserialize() methods of attribute would still work with DataInput/Output. I bet that a lot of people who used the payload feature before took a ByteArrayOutputStream together with DataOutputStream (which implements Java's DataOutput) to populate the payload byte array. With 2125 Lucene will provide an API that is similar to use, but more efficient as it remove the byte[] array indirection and overhead. I'm still +1 for this change. Others? > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788001#action_12788001 ] Michael Busch commented on LUCENE-2126: --- The main reason why I'd like to separate DataInput/Output from IndexInput/Output now is LUCENE-2125. Users should be able to implement methods that serialize/deserialize attributes into/from a postinglist. These methods should only be able to call the read/write methods (which this issue moves to DataInput/Output), but not methods like close(), seek() etc.. Thanks for spending time reviewing this and giving feedback from Lucy land, Marvin! I think I will go ahead and commit this, and once we see a need to allow users to extend DataInput/Output outside of Lucene we can go ahead and make the additional changes that are mentioned in your in my comments here. So I will commit this tomorrow if nobody objects. > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787180#action_12787180 ] Michael Busch commented on LUCENE-2126: --- Thanks for the input, Marvin. I can see the advantages of what you're proposing. With this patch it'd only be possible to benefit in all IndexInput/IndexOutput implementations from a new encoding/decoding method if you add it to the DataInput/Output class directly, which is only possible by changing the classes in Lucene, not from outside. The problem here, as so often, is backwards-compat. This patch here has no problems in that regard, as we just move the methods into new superclasses. If we'd want to implement what Lucy is doing, we'd have to deprecate all encoding/decoding methods in IndexInput/Output and add them to DataInput/Output. Then a DataInput would not be the superclass of IndexInput, but rather *have* an IndexInput. All users who call any of the encoding/decoding methods directly on IndexInput/Output would have to change their code to use the new classes. So I can certainly see the benefits, the question now is if they're at the moment important enough to justify dealing with the backwards-compat hassle? > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2125) Ability to store and retrieve attributes in the inverted index
[ https://issues.apache.org/jira/browse/LUCENE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786855#action_12786855 ] Michael Busch commented on LUCENE-2125: --- {quote} BTW probably the attribute should include a "merge" operation, somehow, to be efficient (simply byte[] copying instead of decode/encode) in the merge case. {quote} Yes, and then I can also close LUCENE-1585! :) > Ability to store and retrieve attributes in the inverted index > -- > > Key: LUCENE-2125 > URL: https://issues.apache.org/jira/browse/LUCENE-2125 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > > Now that we have the cool attribute-based TokenStream API and also the > great new flexible indexing features, the next logical step is to > allow storing the attributes inline in the posting lists. Currently > this is only supported for the PayloadAttribute. > The flex search APIs already provide an AttributeSource, so there will > be a very clean and performant symmetry. It should be seamlessly > possible for the user to define a new attribute, add it to the > TokenStream, and then retrieve it from the flex search APIs. > What I'm planning to do is to add additional methods to the token > attributes (e.g. by adding a new class TokenAttributeImpl, which > extends AttributeImpl and is the super class of all impls in > o.a.l.a.tokenattributes): > - void serialize(DataOutput) > - void deserialize(DataInput) > - boolean storeInIndex() > The indexer will only call the serialize method of an > TokenAttributeImpl in case its storeInIndex() returns true. > The big advantage here is the ease-of-use: A user can implement in one > place everything necessary to add the attribute to the index. > Btw: I'd like to introduce DataOutput and DataInput as super classes > of IndexOutput and IndexInput. They will contain methods like > readByte(), readVInt(), etc., but methods such as close(), > getFilePointer() etc. will stay in the super classes. > Currently the payload concept is hardcoded in > TermsHashPerField and FreqProxTermsWriterPerField. These classes take > care of copying the contents of the PayloadAttribute over into the > intermediate in-memory postinglist representation and reading it > again. Ideally these classes should not know about specific > attributes, but only call serialze() on those attributes that shall > be stored in the posting list. > We also need to change the PositionsEnum and PositionsConsumer APIs to > deal with attributes instead of payloads. > I think the new codecs should all support storing attributes. Only the > preflex one should be hardcoded to only take the PayloadAttribute into > account. > We'll possibly need another extension point that allows us to influence > compression across multiple postings. Today we use the > length-compression trick for the payloads: if the previous payload had > the same length as the current one, we don't store the length > explicitly again, but only set a bit in the shifted position VInt. Since > often all payloads of one posting list have the same length, this > results in effective compression. > Now an advanced user might want to implement a similar encoding, where > it's not enough to just control serialization of a single value, but > where e.g. the previous position can be taken into account to decide > how to encode a value. > I'm not sure yet how this extension point should look like. Maybe the > flex APIs are actually already sufficient. > One major goal of this feature is performance: It ought to be more > efficient to e.g. define an attribute that writes and reads a single > VInt than storing that VInt as a payload. The payload has the overhead > of converting the data into a byte array first. An attribute on the other > hand should be able to call 'int value = dataInput.readVInt();' directly > without the byte[] indirection. > After this part is done I'd like to use a very similar approach for > column-stride fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org