[ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915685#action_12915685 ]
Michael McCandless commented on LUCENE-2575: -------------------------------------------- bq. A copy of the byte[][] refs is made when getReader is called. Hmm why can't the reader just use the current byte[][]? The writer only adds in new blocks to this array (doesn't overwrite the already written blocks, until flush)? (And then allocates a new byte[][] once that array is full). {quote} I think the issue at the moment is I'm using a boolean[] to signify if a byte[] needs to be copied before being written to {quote} Hmm so we also copy-on-write a given byte[] block? Is this because JMM can't make the guarantees we need about other threads reading the bytes written? {quote} I have a suspicion we'll change our minds about pooling byte[]s. We may end up implementing ref counting anyways (as described above), and the sudden garbage generated could be a massive change for users? {quote} But even if we do reuse, we will cause tons of garbage, until the still-open readers are closed? Ie we cannot re-use the byte[] being "held open" by any NRT reader that's still referencing the in-RAM segment after that segment had been flushed to disk. Also the garbage shouldn't be that bad since each object is large. It's not like 3.x's situation with FieldCache or terms dict index, for example.... I would start simple by dropping reuse. We can then add it back if we see perf issues? {quote} Both very common types of queries, so we probably need some type of skipping, which we will, it'll just be single-level. {quote} I would start simple, here, and make skipping stupid, ie just scan. You can get everything working, all tests passing, etc., and then adding in skipping is much more isolated change. You need all the isolation you can get here! This stuff is *hairy*. {quote} As a side note, there is still an issue in my mind around the term frequencies parallel array (introduced in these patches), in that we'd need to make a copy of it for each reader (because if it changes, the scoring model becomes inaccurate?). {quote} Hmm your'e right that each reader needs a private copy, to remain truly "point in time". This (4 bytes per unique term X number of readers reading that term) is a non-trivial addition of RAM. BTW I'm assuming IW will now be modal? Ie caller must tell IW up front if NRT readers will be used? Because non-NRT users shouldn't have to pay all this added RAM cost? > Concurrent byte and int block implementations > --------------------------------------------- > > Key: LUCENE-2575 > URL: https://issues.apache.org/jira/browse/LUCENE-2575 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: Realtime Branch > Reporter: Jason Rutherglen > Fix For: Realtime Branch > > Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, > LUCENE-2575.patch > > > The current *BlockPool implementations aren't quite concurrent. > We really need something that has a locking flush method, where > flush is called at the end of adding a document. Once flushed, > the newly written data would be available to all other reading > threads (ie, postings etc). I'm not sure I understand the slices > concept, it seems like it'd be easier to implement a seekable > random access file like API. One'd seek to a given position, > then read or write from there. The underlying management of byte > arrays could then be hidden? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org