[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1313: --- Fix Version/s: (was: 2.9) 3.1 OK let's push it to 3.1. It's very much in progress, but 1) the iterations are slow (it's a big patch), 2) it's a biggish change so I'd prefer to it shortly after a release, not shortly before, so it has plenty of time to "bake" on trunk. > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch > > > Enable near realtime search in Lucene without external > dependencies. When RAM NRT is enabled, the implementation adds a > RAMDirectory to IndexWriter. Flushes go to the ramdir unless > there is no available space. Merges are completed in the ram > dir until there is no more available ram. > IW.optimize and IW.commit flush the ramdir to the primary > directory, all other operations try to keep segments in ram > until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch * RAM buffer size is stored in the writer rather than set into DocumentsWriter. This is due to the actual ram buffer limit in NRT changing depending on the size of the ramdir. * NRTMergePolicy and IW.resolveRAMSegments merges all ram dir segments to primaryDir (i.e. disk) when the ramDir is over totalMax, or any new merges would put ramDir over totalMax. * In DocumentsWriter we have a set limit on the buffer size which is (tempMax - ramDirSize)/2. This keeps the total ram used under the totalMax (or IW.maxBufferSize), while also keeping our temporary ram usage under the tempMax amount. When DW.ramBuffer limit is reached, it's auto flushed to the ramDir. * All tests pass except TestIndexWriterRAMDir.testFSDirectory. Will look into this further. When flushToRAM is on by default, there seems to be deadlock in org.apache.lucene.TestMergeSchedulerExternal, however when I tried to see if there is any via jconsole by setting ANT_OPTS="-Dcom.sun.management.jmxremote" I didn't see any. I'm not sure if this is due to not connecting to the right process? Or something else. * Added testReadDocuments which insures we can read documents we've flushed to disk. This essentially tests our ability to simultaneously read and write documents to and from the docstore. It seemd to work on Windows. * I think there's more that can be done to more accurately manage the RAM however I think the way it works is a good starting point. > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch > > > Enable near realtime search in Lucene without external > dependencies. When RAM NRT is enabled, the implementation adds a > RAMDirectory to IndexWriter. Flushes go to the ramdir unless > there is no available space. Merges are completed in the ram > dir until there is no more available ram. > IW.optimize and IW.commit flush the ramdir to the primary > directory, all other operations try to keep segments in ram > until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Description: Enable near realtime search in Lucene without external dependencies. When RAM NRT is enabled, the implementation adds a RAMDirectory to IndexWriter. Flushes go to the ramdir unless there is no available space. Merges are completed in the ram dir until there is no more available ram. IW.optimize and IW.commit flush the ramdir to the primary directory, all other operations try to keep segments in ram until there is no more space. was: Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch > > > Enable near realtime search in Lucene without external > dependencies. When RAM NRT is enabled, the implementation adds a > RAMDirectory to IndexWriter. Flushes go to the ramdir unless > there is no available space. Merges are completed in the ram > dir until there is no more available ram. > IW.optimize and IW.commit flush the ramdir to the primary > directory, all other operations try to keep segments in ram > until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch * All tests pass, added more tests * Added DocumentsWriter.growRamBufferBy/growRamDirMaxBy methods that allow dynamically requesting more ram. We start off at 50/50, ramdir/rambuffer. Then whenever one needs more, grow* is called. * We need a RAMPolicy class that allows customizing how ram is allocated. Currently the ramdir and the rambuffer compete for space, the user will presumably want to customize this. * I'm not sure the flushing always occurs when it should, and not sure yet how to test to insure it's flushing when it should (other than watching a log). What happened to the adding logging to Lucene patch? > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch * A single merge scheduler is used. We will need to open a new issue for a version of ConcurrentMergeScheduler that allocates threads perhaps based on the merge.directory? We'd also probably want to add thread pooling. * There's a package protected IW ctor that accepts the ram dir. This is used in the test case for insuring we aren't creating .cfs files in the ram dir. * IW.optimize merges all segments (ram included) to the primary dir * IW.expungeDeletes merges segments with deletes, in ram ones stay in ram (unless they won't fit), and primary dir ones are handled as usual * Added testOptimize, testExpungeDeletes, and some other test cases * Needs a test case to make sure we're merging to the primary dir when the ram dir is full or a flush won't fit in the ram dir * There's a mergeRamSegmentsToDir and resolveRamSegments. Two different methods because mergeRamSegmentsToDir operates by simply scheduling merges, resolveRamSegments operates in the foreground like resolveExternalSegments. I'm not sure if we can combine the two. resolveRamSegments seems to have a thread notification problem and so hangs at times. I'll look into this further unless it's obvious what the problem is. * When RAM NRT is on (via the IndexWriter constructor), setting the ram buffer size allocates half of the given number to the DocumentsWriter buffer and half to the ram dir. It may be best to dynamically change these numbers based on usage etc. * Added NRTMergePolicy which is used only when RAM NRT is on. It utilizes the regular merge policy and the ram merge policy. * The ram dir size is pushed to DocumentsWriter * RAMMergePolicy extends LogDocMergePolicy and defaults the useCompoundFile and useCompoundDocStore to false * Sorry for the whitespace stuff, I'll clean it up later, I wanted to post the latest to get feedback > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch * In DocumentsWriter.balanceRAM if NRT is on the total ram consumed is "(numBytesUsed * 2) + writer.getRamDirSize()". numBytesUsed is the current consumption of the ram buffer. Basically what we flush to ram, we'll consume that much of the buffer. This is now taken into account in the bufferIsFull calculation. * Double dir usage should be factored out. * TestIndexWriterRamDir.testFSDirectory fails. It tries to simulate a crashing IW. When the IW is created again it should delete the old files, for some reason it's not with FSDirectory (open file handles on Windows perhaps) {quote} we could flush the new segment directly to the real dir as one segment, and merge all prior RAM segments as a separate new segment in the main dir, if the free RAM is large enough. {quote} Yeah it's unclear what the best policy is here. Do we want to have some sort of custom merge policy method/class to take care of this so the user can customize it? > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch * IndexFileDeleter takes into account the ram directory (which when using NRT with the FSD caused files to not be found). * FSD is included and writes fdx, fdt, tvx, tvf, tvd extension files to the primary directory (which is the same as IW.directory). LUCENE-1618 needs to be updated with these changes (or we simply include it in this patch as the LUCENE-1618 patch is only a couple of files). * Removed DocumentsWriter.ramOverLimit * I think we need to give the option of a ram mergescheduler because the user may want not want the ram merging and disk merging to compete for threads. I'm thinking if of the use case where NRT is a priority then one may allocate more threads to the ram CMS and less to the disk CMS. This also gives us the option of trying out more parameters when performing benchmarks of NRT. * We may want to default the ram mergepolicy to not use compound files as it's not useful when using a ram dir? * Because FSD uses IW.directory, FSD will list files that originated from FSD and from IW.directory, we may want to keep track of which files are supposed to be in FSD (from the underlying primary dir) and which are not? {quote}If NRT is never used, the behavior of IW should be unchanged (which is not the case w/ this patch I think). RAMDir should be created the first time a flush is done due to NRT creation. {quote} In the patch if ramdir is not passed in, the behavior of IW remains the same as it is today. You're saying we should have IW create the ramdir by default after getReader is called and remove the IW ramdir constructor? What if the user has an alternative ramdir implementation they want to use? {quote}StoredFieldsWriter & TermVectorsTermsWriter now writes to IndexWriter.getFlushDirectory(), which is confusing because that method returns the RAMDir if set? Shouldn't this be the opposite? (Ie it should flush to IndexWriter.getDirectory()? Or we should change getFlushDiretory to NOT return the ramdir?){quote} The attached patch uses FileSwitchDirectory, where these files are written to the primary directory (IW.directory). So getFlushDirectory is ok? {quote}Why did you need to add synchronized to some of the SegmentInfo files methods? (What breaks if you undo that?). The contract here is IW protects access to SegmentInfo/s{quote} SegmentInfo.files was being cleared while sizeInBytes was called which resulted in an NPE. The alternative is sync IW in IW.size(SegmentInfos) which seems a bit extreme just to obtain the size of a segment info? {quote}The MergePolicy needs some smarts when it's dealing w/ RAM. EG it should not do a merge of more than XXX% of total RAM usage (should flush to the real directory instead){quote} Isn't this handled well enough in updatePendingMerges or is there more that needs to be done? > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch Fixed and cleaned up more. All tests pass Added entry in CHANGES.txt I'm going to integrate LUCENE-1618 and test that out as a part of the next patch. > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch * Ok, fixed the ensureContiguousMerge exception by asking the mergePolicy (not ramMergePolicy) to evaluate the ram segment infos as an optimize to directory. Now all the current tests pass. * The patch is cleaned up a little, needs more, and further test cases. * IndexWriter doesn't accept setRAMDirectory anymore, it needs to be passed into the IndexWriter constructor. This because we can't run the system and the ram dir is changed in the middle of an operation. > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch {quote} Would you re-use MergePolicy, or make a new RAMMergePolicy? {quote} MergePolicy is used as is with a special IW method that handles merging ram segments for the real directory (which has an issue around merging contiguous segments, can that be relaxed in this case as I don't understand why this is?) The patch is not committable, however I am posting it to show a path that seems to work. It includes test cases for merging in ram and merging to the real directory. * IW.getFlushDirectory is used by internal calls to obtain the directory to flush segments to. This is used in DocumentsWriter related calls. * DocumentsWriter.directory is removed so that methods requiring the directory call IW.getFlushDirectory instead. * IW.setRAMDirectory sets the ram directory to be used. * IW.setRAMMergePolicy sets the merge policy to be used for merging segments on the ram dir. * In IW.updatePendingMerges totalRamUsed is the size of the ram segments + the ram buffer used. If totalRamUsed exceeds the max ram buffer size then IW. updatePendingRamMergesToRealDir is called. * IW. updatePendingRamMergesToRealDir registers a merge of the ram segments to the real directory (currently causes a non-contiguous segments exception) * MergePolicy.OneMerge has a directory attribute used when building the merge.info in _mergeInit. * Test case includes testMergeInRam, testMergeToDisk, testMergeRamExceeded There is one error that occurs regularly in testMergeRamExceeded {code} MergePolicy selected non-contiguous segments to merge (_bo:cx83 _bm:cx4 _bn:cx2 _bl:cx1->_bj _bp:cx1->_bp _bq:cx1->_bp _c2:cx1->_c2 _c3:cx1->_c2 _c4:cx1->_c2 vs _5x:c120 _6a:c8 _6t:c11 _bo:cx83** _bm:cx4** _bn:cx2** _bl:cx1->_bj** _bp:cx1->_bp** _bq:cx1->_bp** _c1:c10 _c2:cx1->_c2** _c3:cx1->_c2** _c4:cx1->_c2**), which IndexWriter (currently) cannot handle {code} > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch * The RAMIndex deletes approach changed to be like IndexWriter. The deletes are queued in lists, then applied on RI.flush. * There is redundancy between IW.delete* and RI.delete*, perhaps we don't need RI.delete*? * We need more multithreaded tests, probably based on TestIndexWriter to see if we can trigger issues in regards to deletes that occur while RI is calling IW.addIndexesNoOptimize. * If RI.delete* is removed, do we need a separate RAMIndex class to add documents to or is there a more transparent way for NRT ramdir to work? Perhaps we can add an IW.flushToRamDir (whereas IW.flush writes to the IW directory) method that flushes the rambuffer to the RAMIndex? Some of the the issues are around swapping out the RAMDir once it's segments are flushed to IW. If we took this approach would we need a IW.getReaderRAM method that instead of flushing to disk flushes to the ramdir? The other problem with the IW.flushToRamDir system is the loss of concurrency where a large rambuffer may be flushing to disk while the user really wants to small incremental NRT RI based updates at the same time. > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch I added an IndexWriter.getRAMIndex method that returns a RAMIndex object that can be updated and flushed to the underlying writer. I think this is better than adding more methods to IndexWriter and it separates out the logic of the RAM based near realtime index and the rest of IW. Package protected IW.addIndexesNoOptimize(DirectoryIndexReader[] readers) is added which is used by RAMIndex.flush. I thought this functionality could work for LUCENE-1589 as a public method, however because of the way IndexWriter performs merges using segment infos, handling generic IndexReader classes (which may not use segmentinfos) would then be difficult in the addIndexesNoOptimize case. I think RAMIndex.flush to the underlying writer is not synchronized. If the IW is using ConcurrentMergeScheduler then the heavy lifting is performed in the background and so should not delay adding more documents to the RAMIndex. IW.getReader returns the normal IW reader and the RAMIndex reader if there is one. The RAMIndex writer can be obtained and modified directly as opposed to duplicating the setter methods of IndexWriter such as setMergeScheduler. > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.jar Latest realtime code, transactions are removed. * Needs to be benchmarked * There could be concurrency issues around deletes that occur while directories are being flushed to disk. * It's Java JARed to include the files and directory structure. The patch relies on LUCENE-1516 which if included would make the changes incomprehensible > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch The patch includes RealtimeIndex a basic class for performing atomic transactional realtime indexing and search. A single thread periodically flushes to disk the ram index. It relies on LUCENE-1516. We need to benchmark this, specifically 1) realtime w/ramdir transaction 2) realtime w/queued documents transaction 3) normal indexing. Realtime w/ramdir encodes the transaction to a RAMDirectory which is added to the RAM writer using IW.addIndexesNoOptimize. Option 1 may be slower than option 2, however if the system is replicating it may be the only option? Long term I believe we need to implement searching over the IndexWriter ram buffer (if possible). However I am not sure how option 2 would work with it? > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Component/s: (was: contrib/*) Index Fix Version/s: 2.9 Priority: Minor (was: Major) Description: Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. was: Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction. TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class. - getSearcher() - addDocument(Document document) - addDocument(Document document, Analyzer analyzer) - updateDocument(Term term, Document document) - updateDocument(Term term, Document document, Analyzer analyzer) - deleteDocument(Term term) - deleteDocument(Query query) - commitTransaction(List documents, Analyzer analyzer, List deleteByTerms, List deleteByQueries) Sample code: {code} // setup FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log"); LogDirectory logDirectory = directoryMap.getLogDirectory(); TransactionLog transactionLog = new TransactionLog(logDirectory); TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap); // transaction Document d = new Document(); d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED)); system.addDocument(d); // search OceanSearcher searcher = system.getSearcher(); ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; System.out.println(hits.length + " total results"); for (int i = 0; i < hits.length && i < 10; i++) { Document d = searcher.doc(hits[i].doc); System.out.println(i + " " + hits[i].score+ " " + d.get("contents"); } {code} There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing. A sample disk directory structure is as follows: |/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index| |/3 | Directory containing an on disk Lucene index| |/log | Directory containing log files| |/log/log0001.bin | Log file. As new log files are created the suffix number is incremented| Affects Version/s: 2.4.1 Summary: Realtime Search (was: Ocean Realtime Search) > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org