[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841120#action_12841120 ] Michael Busch commented on LUCENE-2293: --- {quote} Also, in the pull approach, Lucene would introduce another place where it allocates threads. {quote} What I described is not much different from what's happening today. DocumentsWriter has already a WaitQueue, that ensures that the docs are written in the right order. I simply tried to suggest a way to refactor our classes... functionally the same as what Mike suggested. I shouldn't have said pulled from (the queue). IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Lucene Filter
yaa... and now I am trying with multiple filters. Thanks -- View this message in context: http://old.nabble.com/Lucene-Filter-tp27756577p27778081.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841127#action_12841127 ] Shai Erera commented on LUCENE-2293: bq. What I described is not much different from what's happening today. Maybe I didn't understand then: {quote} basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document? {quote} Who adds documents to that queue and what are the DW instances? The way I read it, I understood those are different threads than the application threads. If I misunderstood that, could you please clarify? Also, I thought that each thread writes to different ThreadState does not ensure documents are written in order, but that finally when DW flushes, the different ThreadStates are merged together and one segment is written, somehow restores the orderness ... If only WaitQueue was documented :). I obviously don't know that part of the code as well as you. So if I misunderstood your meaning, I'd appreciate if you clarify it for me. What I would like to avoid is having Lucene allocate indexing threads on its own. Also, is my proposal above different than what you suggest? IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841135#action_12841135 ] Michael Busch commented on LUCENE-2293: --- Sorry - after reading my comment again I can see why it was confusing. Loadbalancer wasn't a very good analogy. I totally agree that Lucene should still piggyback on the application's threads and not start its own thread for document inversion. Today, as you said, does the DocumentsWriter manage a certain number of thread states, has the WaitQueue, and its own memory management. What I was thinking was that it would be simpler if the DocumentsWriter was only used by a single thread. The IndexWriter would have multiple DocumentsWriters and do the thread binding (+waitqueue). This would make the code in DocumentsWriter and the downstream classes simpler. The side-effect is that each DocumentsWriter would manage its own memory. {quote} Also, I thought that each thread writes to different ThreadState does not ensure documents are written in order, but that finally when DW flushes, the different ThreadStates are merged together and one segment is written, somehow restores the orderness ... {quote} Stored fields are written to an on-disk stream (docstore) in order. The WaitQueue takes care of finishing the docs in the right order. The postings are written into TermHashes per threadstate in parallel. The doc ids are in increasing order, but can have gaps. E.g. Threadstate 1 inverts doc 1 and 3, Threadstate 2 inverts doc 2. When it's time to flush the whole buffer these different TermHash postingslists get interleaved. IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841140#action_12841140 ] Shai Erera commented on LUCENE-2293: Ok so I think I understand now. You propose to change IW to bind a Thread to a DW, instead of that being done inside DW. And therefore it will simplify DW's code ... I wonder if that won't complicate IW code in return? Perhaps we'll gain a lot of simplification on DW, so a bit of complexity on IW will be ok. If we do that .. why not renaming DW to SegmentWriter? If each DW will eventually flush its own Segment, the name would make more sense? BTW, I was thinking that an application can emulate this sort of thing even today (well ... to some extent - w/o deletes). It can create an IW for each indexing thread and at the end call addIndexes. What we'd need to introduce on IW to make it efficient though is something like addRawIndexes, which will just update the segments file about the new segments, but won't attempt to merge them and clean deletes out of them. I think I want this API anyway for being able to add segments faster to an index, if e.g. you don't care about the merges at the moment ... but that is separate issue. Then I think what I proposed is more or less the same as you propose, therefore I'm fine with that approach. When a DW/SW realizes it exhausted its memory pool, it just flushes and new threads will bind to other DW/SW. Thanks for the explanation on WaitQueue. IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841152#action_12841152 ] Uwe Schindler commented on LUCENE-2294: --- +1 for the IndexWriterConfig with chaining method calls We had a discussion about this a while ago on the mailinglist: [http://www.lucidimagination.com/search/document/d32100d8a7b67366/lucene_2_9_and_deprecated_ir_open_methods#19e1a19f4d340b8c] Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841152#action_12841152 ] Uwe Schindler edited comment on LUCENE-2294 at 3/4/10 10:00 AM: +1 for the IndexWriterConfig with chaining method calls We had a discussion about this a while ago on the mailinglist: [http://www.lucidimagination.com/search/document/19e1a19f4d340b8c/lucene_2_9_and_deprecated_ir_open_methods] was (Author: thetaphi): +1 for the IndexWriterConfig with chaining method calls We had a discussion about this a while ago on the mailinglist: [http://www.lucidimagination.com/search/document/d32100d8a7b67366/lucene_2_9_and_deprecated_ir_open_methods#19e1a19f4d340b8c] Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841166#action_12841166 ] Shai Erera commented on LUCENE-2294: Thanks Uwe for the pointer. I suppose that MaxFieldLength should now move to IndexWriterConfig, only it is public and therefore needs to be deprecated. Otherwise it will look strange that in order to set MFL on IWC you need to reference IW. So deprecate and duplicate? To IW it doesn't matter because it just takes the limit (int) from MFL ... Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841172#action_12841172 ] Uwe Schindler commented on LUCENE-2294: --- In my opinion the whole class is unneeded, so only deprecate in IW but not add it to IWConfig. For me a constant in IWConfig would be enough that defines UNLIMITED and everything else is just an integer. +1 for defaulting to static final UNLIMITED=Integer.MAX_VALUE. I am not sure why this limitation is there at all. In my opinion it should be left to the apploication to limit the number of tokens if needed, but not silently drop tokens. If somebody gets an OOM, he can adjust the value and knows that mayabe some tokens get lost. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841173#action_12841173 ] Shai Erera commented on LUCENE-2294: Yeah I don't like it either (makes my code unnecessarily long). And I always use UNLIMITED, and the LIMITED=10,000 is really just a guess, and so if anyone wants to limit it, he needs to do new MaxFieldLength(otherLimit) which is unnecessarily long as well ... I like it - I'll deprecate on IW and introduce UNLIMITED on IWC. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Fwd: (SOLR-355) Parsing mixed inclusive/exclusive range queries
If Solr/Lucene dev were merged, and queryParser is it's own module, this user could simply upgrade his queryParser JAR to get this fix. Mike -- Forwarded message -- From: Alexander S (JIRA) j...@apache.org Date: Thu, Mar 4, 2010 at 2:24 AM Subject: (SOLR-355) Parsing mixed inclusive/exclusive range queries To: solr-...@lucene.apache.org [ https://issues.apache.org/jira/browse/SOLR-355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841105#action_12841105 ] Alexander S commented on SOLR-355: -- It is fixed in Lucene, can we get it to SOLR? Parsing mixed inclusive/exclusive range queries --- Key: SOLR-355 URL: https://issues.apache.org/jira/browse/SOLR-355 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.2 Reporter: Andrew Schurman Priority: Minor Attachments: solr-355.patch The current query parser doesn't handle parsing a range query (i.e. ConstantScoreRangeQuery) with mixed inclusive/exclusive bounds. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841189#action_12841189 ] Shai Erera commented on LUCENE-2294: I was wondering if perhaps instead of allowing to pass a create=true/false, we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND. The current meaning of create is a bit unclear. I.e. if it is true, then overwrite. But if it is false, don't attempt to create, but just open an existing one. However if the directory is empty, it throws an exception. I think an enum would someone to pass CREATE_OR_APPEND in case he doesn't know if there is an index there ... but I don't want to complicate things unnecessarily ... what do you think? Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841193#action_12841193 ] Earwin Burrfoot commented on LUCENE-2293: - bq. I wonder if that won't complicate IW code in return? Perhaps we'll gain a lot of simplification on DW, so a bit of complexity on IW will be ok. That will get rid of all that *PerThread insanity for each DW component, if I'm getting it right. That's -13 classes. Yay for the issue! On a random sidenote, can we group things like these into subpackages? Having 132 files in oal.index is somewhat intimidating when trying to read/understand things. IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841199#action_12841199 ] Shai Erera commented on LUCENE-2294: IndexingChain is one of the things that can be set on IW, however I don't see any implementations of it besides the default, and the class itself is package-private, so no app could actually set it on IW (unless it puts its code under o.a.l.index). Therefore I'm thinking of not introducing it on IWC, or turn it to a public class? Is it really something we expect any application out there to set, or can we simply make DocsWriter impl one for itself internally, and don't declare this class as abstract etc.? Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841207#action_12841207 ] Shai Erera commented on LUCENE-2294: I'm thinking to make this whole IWC a constructor only parameter to IW, without the ability to set it afterwards. I don't see any reason why would anyone change the RAM limit, Similarity etc while IW is running. What's the advantage vs. say close the current IW and open a new one with the different settings? I know the latter is more expensive, and I write it deliberately - I think those settings are really ctor-only settings. Otherwise you might get inconsistent documents (like changing the Similarity or max field length). This will also simplify IWC, because now I need to distinguish between settings that cannot be altered afterwards, like changing IndexDeletionPolicy, create, IndexCommit, Analyzer ... if IWC will be a ctor only object, I can have only the default ctor (to init to default settings) and provide the setters otherwise. Any objections? Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841225#action_12841225 ] Michael McCandless commented on LUCENE-2293: I agree IW should not spawn its own threads. It should piggy back on incoming threads. On whether we can remove the perThread layer throughout the chain -- that would be compelling. But, we should scrutinize what that layer does throughout the current chain to assess what we might lose. But, I was proposing a bigger change (call it private RAM segments): there would be multiple DWs, each one writing to its own private RAM segment (each one getting private docID assignment) *and* its own doc stores. There would be no more WaitQueue in IW. Each DW would flush its own segment privately. They would not all flush at once (merging their postings) like we must do today because they share a single docID space. As I understand it, this would be step towards how Lucy handles concurrency during indexing. Ie, it'd make the DWs nearly fully independent from one another, and then IW is just there to dispatch/do merging/etc. (In Lucy each writer is a separate process, I think -- VERY independent). We could do both changes, too (remove the perThread layer of indexing chaing and switch to private RAM segments) -- I think they are actually orthogonal. bq. The other downside is that you would have to buffer deleted docs and queries separately for each thread state, because you have to keep the private docID? So that would nee a bit more memory. Right. bq. Mike, good one! Would having a doc id stream per thread make implementing a searchable RAM buffer easier? Yes -- they would just appear like sub segments. bq. I hope we won't lose monotonic docIDs for a singlethreaded indexation somewhere along that path. We won't. {quote} Instead, I prefer to take advantage of the application's concurrency level in the following way: * Each thread will continue to write documents to a ThreadState. We'll allow changing the MAX_LEVEL, so if an app wants to get more concurrency, it can. - MAX_LEVEL will set the number of ThreadState objects available. * All threads will obtain memory buffers from a pull which will be limited by IW's RAM limit. * When a thread finishes indexing a document and realizes the pool has been exhausted, it flushes its ThreadState. - At that moment, that ThreadState is pulled out of the 'active' list and is flushed. When it's done, it reclaims its used buffers and being put again in the active list. - New threads that come in will simply pick a ThreadState from the pool (but we'll bind them to that instance until it's flushed) and add documents to them. - That way, we hijack an application thread to do the flushing, which is anyway what happens today. {quote} +1 -- this I think matches what I was thinking. bq. If only WaitQueue was documented Sorry :( But WaitQueue would go away with this change. We would no longer have shared doc stores! IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush
Vote on merging dev of Lucene and Solr
For those committers that don't follow the general mailing list, or follow it that closely, we are currently having a vote for committers: http://search.lucidimagination.com/search/document/4722d3144c2e3a8b/vote_merge_lucene_solr_development -- - Mark http://www.lucidimagination.com
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841232#action_12841232 ] Michael McCandless commented on LUCENE-2294: +1 -- this is great! bq. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). This is because it's a dangerous setting (you silently lose content while indexing), a trap. So we want to force the user to make the choice, up front, so they realize the implications. But, if we change the default to UNLIMITED (which we should do under Version), then I agree you should not have to specify it. bq. In my opinion it should be left to the apploication to limit the number of tokens if needed, but not silently drop tokens I like that approach -- we could make a TokenFilter to do this? Then we don't need MFL at all in IWC (and deprecate in IW). bq. I was wondering if perhaps instead of allowing to pass a create=true/false, we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND +1 bq. I'm thinking to make this whole IWC a constructor only parameter to IW, without the ability to set it afterwards. +1 in general, though we should go setting by setting to confirm this is OK. I don't know of real use cases where apps eg want to change RAM buffer or mergeFactor... but maybe there are some interesting usages out there. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841235#action_12841235 ] Shai Erera commented on LUCENE-2293: Perhaps instead of buffering the delete Terms/Queries somewhere central, when a delete by term is performed by a certain DW, it can register it immediately on all existing DWs. Each DW will record the doc ID up until which this term delete should be executed, and when it's its time to flush, will apply all the deletes that were accumulated on itself. It'll be like doing a Parallel segment deletes (but maybe I'm too into Parallel Indexing :)). This should not affect any documents that were added to any DW after the delete happened, and if we simply do it (sycned) across all active DWs, I think we should be fine? IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841239#action_12841239 ] Shai Erera commented on LUCENE-2294: bq. But, if we change the default to UNLIMITED Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the new API can define whatever he wants. We'll default to UNLIMITED but there won't be any back-compat issue ... bq. we could make a TokenFilter to do this? I'm afraid that will result in changing all Analyzers to work properly? Or you mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this filter? That could work, but as soon as that becomes a filter, people may use it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em down?). Also, if I'd use such a filter myself, I wouldn't put it last in the chain, so that I can avoid doing any processing on a term that is not going to end up in the index. Although that's not too critical because I'll be doing this for just one term ... I guess I'd like to keep it as it is now, not turning the issue into a bigger thing ... and a filter alone won't solve it - we'd still need to provide a way to configure it, or otherwise everyone will need to wrap their Analyzers with such filter? Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249 ] Mark Miller commented on LUCENE-2294: - I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats *required* to set on the config class - without the same cstr business ... Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249 ] Mark Miller edited comment on LUCENE-2294 at 3/4/10 1:45 PM: - I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats *required* to set on the config class - without the same cstr business ... *edit* Though I suppose the chaining *does* makes this more swallowable... new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ... was (Author: markrmil...@gmail.com): I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats *required* to set on the config class - without the same cstr business ... Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841262#action_12841262 ] Shai Erera commented on LUCENE-2294: I wouldn't worry about what's required - Directory will be left out, MFL is useless and a pain anyway, so what's left is Analyzer. I can put Analyzer on IWC's ctor, but I personally think we can default to a simple one (such as Whitespace) encouraging the people to set their own. I find it very annoying today when I want to test something about IW and I need to pass all these things to IW ... The way I see it, those who want to rely on Lucene's latest and greatest can just do: IndexWriter writer = new IndexWriter(dir, new IWC()); Well maybe except for the Analyzer, but I really don't think it matters that much. And like you wrote, someone can chain the setters. So win-win? If you don't care about anything, just wants to open a writer, index something and that's it, you don't need to specify anything .. otherwise you just chain calls? One thing I should add to IWC so far (I hope to post a patch even today) is a Version parameter. For now it will be ignored, but as a placeholder to change settings in the future. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Vote on merging dev of Lucene and Solr
+1 On Thu, Mar 4, 2010 at 6:32 PM, Mark Miller markrmil...@gmail.com wrote: For those committers that don't follow the general mailing list, or follow it that closely, we are currently having a vote for committers: http://search.lucidimagination.com/search/document/4722d3144c2e3a8b/vote_merge_lucene_solr_development -- - Mark http://www.lucidimagination.com -- - Noble Paul | Systems Architect| AOL | http://aol.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841339#action_12841339 ] Michael McCandless commented on LUCENE-2294: bq. Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the new API can define whatever he wants. We'll default to UNLIMITED but there won't be any back-compat issue .. Ahh sorry right. In the olden days, 1 was the default. {quote} bq. we could make a TokenFilter to do this? I'm afraid that will result in changing all Analyzers to work properly? Or you mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this filter? That could work, but as soon as that becomes a filter, people may use it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em down?). {quote} Hmm yeah quite a hassle to fix all analyzers. Hmmm. bq. I guess I'd like to keep it as it is now, not turning the issue into a bigger thing ... and a filter alone won't solve it - we'd still need to provide a way to configure it, or otherwise everyone will need to wrap their Analyzers with such filter? Maybe one solution is to wrap any other analyzer? Ie, create a StopAfterNTokensAnalyzer, taking another analyzer that it delegates to, and then sticking on this StopAfterNTokensFilter to each token stream. But yeah maybe break this out as a separate issue... bq. Also, if I'd use such a filter myself, I wouldn't put it last in the chain, so that I can avoid doing any processing on a term that is not going to end up in the index. Although that's not too critical because I'll be doing this for just one term ... Actually it ought to be 0 terms wasted, with the filter @ the end -- with this StopAfterNTokensFilter, it'll immediately return false w/o asking for the 10001th token. bq. One thing I should add to IWC so far (I hope to post a patch even today) is a Version parameter. For now it will be ignored, but as a placeholder to change settings in the future. +1 Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841342#action_12841342 ] Jason Rutherglen commented on LUCENE-2293: -- bq. But WaitQueue would go away with this change. We would no longer have shared doc stores! Cool, most of the DW code is intuitive except the shared doc stores because it's hard to when see when a doc store ends. Also the interleaving is a bit difficult to visualize. I look forward to checking out DW after this change. IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841347#action_12841347 ] Michael McCandless commented on LUCENE-2293: Yes, I think each DW will have to record its own buffered delete Term/Query, mapping to its docID at the time the delete arrived. Syncing across all of them would work but may be overkill. I think we could instead have a lock free collection (need not even be FIFO -- the order doesn't matter) into which we add all Term/Query that are deleted. Then, any time a thread hits that DW to add a document, it must first service that queue, by popping out all Term/Query stored in it and enrolling them the un-synchronized map of Term/Query - docID). IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Request for clarification on unordered SpanNearQuery
I've been working on some highlighting changes involving Spans (https://issues.apache.org/jira/browse/LUCENE-2287) and could use some help understanding when overlapping Spans are valid. To illustrate, I added the test below to the TestSpans class; this test fails because there is no fourth range. Am I wrong in my expectation that that last range would match? Thanks. Mike // Doc 11 contains t1 t2 t1 t3 t2 t3 public void testSpanNearUnOrderedOverlap() throws Exception { boolean ordered = false; int slop = 1; SpanNearQuery snq = new SpanNearQuery( new SpanQuery[] { makeSpanTermQuery(t1), makeSpanTermQuery(t2), makeSpanTermQuery(t3) }, slop, ordered); Spans spans = snq.getSpans(searcher.getIndexReader()); assertTrue(first range, spans.next()); assertEquals(first doc, 11, spans.doc()); assertEquals(first start, 0, spans.start()); assertEquals(first end, 4, spans.end()); assertTrue(second range, spans.next()); assertEquals(second doc, 11, spans.doc()); assertEquals(second start, 1, spans.start()); assertEquals(second end, 4, spans.end()); assertTrue(third range, spans.next()); assertEquals(third doc, 11, spans.doc()); assertEquals(third start, 2, spans.start()); assertEquals(third end, 5, spans.end()); // Question: why wouldn't this Span be found? assertTrue(fourth range, spans.next()); assertEquals(fourth doc, 11, spans.doc()); assertEquals(fourth start, 2, spans.start()); assertEquals(fourth end, 6, spans.end()); assertFalse(fifth range, spans.next()); }
Re: Request for clarification on unordered SpanNearQuery
On 03/04/2010 11:34 AM, Goddard, Michael J. wrote: // Question: why wouldn't this Span be found? assertTrue(fourth range, spans.next()); assertEquals(fourth doc, 11, spans.doc()); assertEquals(fourth start, 2, spans.start()); assertEquals(fourth end, 6, spans.end()); Spans are funny beasts ;) No Spans ever start from the same position more than once. In effect, they are always marching forward. The third range starts at 2, and once it finds a match starting at 2, it moves on. So it won't find the other match that starts at 2. Spans are not exhaustive - exhaustive matching would be a different algorithm. So yes, you are wrong in your expectation :) Just how Spans were implemented. -- - Mark http://www.lucidimagination.com
[jira] Resolved: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter
[ https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2283. Resolution: Fixed Thanks Tim! Possible Memory Leak in StoredFieldsWriter -- Key: LUCENE-2283 URL: https://issues.apache.org/jira/browse/LUCENE-2283 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4.1 Reporter: Tim Smith Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2283.patch, LUCENE-2283.patch, LUCENE-2283.patch StoredFieldsWriter creates a pool of PerDoc instances this pool will grow but never be reclaimed by any mechanism furthermore, each PerDoc instance contains a RAMFile. this RAMFile will also never be truncated (and will only ever grow) (as far as i can tell) When feeding documents with large number of stored fields (or one large dominating stored field) this can result in memory being consumed in the RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very large, even if large documents are rare. Seems like there should be some attempt to reclaim memory from the PerDoc[] instance pool (or otherwise limit the size of RAMFiles that are cached) etc -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Request for clarification on unordered SpanNearQuery
Michael, The test for the 4th range fails because the first matching subspans (for t1 in this case) is always the one that is first advanced, and the first match at that point has a less slop (0) than the maximum allowed (1) so one might actually try and advance another subspans first. But that is not really straightforward to implement, especially when different terms can be indexed in the same position. Perhaps the javadocs for the unordered case should be improved to mention that in the unordered case the first subspans is always the one that is advanced first. Regards, Paul Elschot Op donderdag 04 maart 2010 17:34:26 schreef Goddard, Michael J.: I've been working on some highlighting changes involving Spans (https://issues.apache.org/jira/browse/LUCENE-2287) and could use some help understanding when overlapping Spans are valid. To illustrate, I added the test below to the TestSpans class; this test fails because there is no fourth range. Am I wrong in my expectation that that last range would match? Thanks. Mike // Doc 11 contains t1 t2 t1 t3 t2 t3 public void testSpanNearUnOrderedOverlap() throws Exception { boolean ordered = false; int slop = 1; SpanNearQuery snq = new SpanNearQuery( new SpanQuery[] { makeSpanTermQuery(t1), makeSpanTermQuery(t2), makeSpanTermQuery(t3) }, slop, ordered); Spans spans = snq.getSpans(searcher.getIndexReader()); assertTrue(first range, spans.next()); assertEquals(first doc, 11, spans.doc()); assertEquals(first start, 0, spans.start()); assertEquals(first end, 4, spans.end()); assertTrue(second range, spans.next()); assertEquals(second doc, 11, spans.doc()); assertEquals(second start, 1, spans.start()); assertEquals(second end, 4, spans.end()); assertTrue(third range, spans.next()); assertEquals(third doc, 11, spans.doc()); assertEquals(third start, 2, spans.start()); assertEquals(third end, 5, spans.end()); // Question: why wouldn't this Span be found? assertTrue(fourth range, spans.next()); assertEquals(fourth doc, 11, spans.doc()); assertEquals(fourth start, 2, spans.start()); assertEquals(fourth end, 6, spans.end()); assertFalse(fifth range, spans.next()); } - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841388#action_12841388 ] Michael Busch commented on LUCENE-2293: --- {quote} But, I was proposing a bigger change (call it private RAM segments): there would be multiple DWs, each one writing to its own private RAM segment (each one getting private docID assignment) and its own doc stores. {quote} Cool! I wasn't sure if you wanted to give them private doc stores too. +1, I like it. IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Tue, Mar 2, 2010 at 4:12 PM, Marvin Humphrey mar...@rectangular.com wrote: On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote: The problem is, these scoring models need the avg field length (in tokens) across the entire index, to compute the norms. Ie, you can't do that on writing a single segment. I don't see why not. We can just move everything you're doing on Searcher open to index time, and calculate the stats and norms before writing the segment out. At search time, the only segment with valid norms would be the last one, so we'd make sure the Searcher used those. I see -- write norms for all segments (the full index) on each commit? OK. And in fact if we left it at searcher init time, you'd still [technically] have to recompute the norms arrays across all segments whenever one even tiny segment was added, since [technically] the average has changed. But I agree, once the index is large enough, presumably the average won't change much, so... Even in the NRT case we'd have to compute norms across the entire index with only a small segment added. I think the fact that Lucy always writes one segment per indexing session -- as opposed to Lucene's one segment per document -- makes a difference here. Lucene isn't one segment per doc anymore -- it's one segment per-when-RAM-buffer-filled-up. Not sure it really makes a difference though, since we [technically] need norms regen'd for the entire index. Whether burning norms to disk at index time is the most efficient setup depends on the ratio of commits to searcher-opens. Yes, and NRT opens. In a multi-node search cluster, pre-calculating norms at index-time wouldn't work well without additional communication between nodes to gather corpus-wide stats. But I suspect the same trick that works for IDF in large corpuses would work for average field length: it will tend to be the stable over time, so you can update it infrequently. Right I imagine we'd need to use this trick within a single index, too. Recomputing norms for entire index when only a small new segment was added to the new NRT reader will probably be too costly. Though one alternative (if you don't mind burning RAM) is to skip casting to norms, ie store the actual field length, and do the divide-by-avg during scoring (though that's a biggish hit to search perf). So I think it must be done during searcher init. The most we can do is store the aggregates (eg sum of all lengths in this segment) in the SegmentInfo -- this saves one pass on searcher init. Logically... token_counts: { segment: { title: 4, content: 154, }, all: { title: 98342, content: 2854213 } } (Would that suffice? I don't recall the gory details of BM25.) I think so, though why store all, per segment? Reader can regen on open? (That above json comes from a single segment right?). lnu.ltc would need sum(avg(tf)) as well. As documents get deleted, the stats will gradually drift out of sync, just like doc freq does. However, that's mitigated if you recycle segments that exceed a threshold deletion percentage on a regular basis. Right. The norms array will be stored in this per-field sim instance. Interesting, but that wasn't where I was thinking of putting them. Similarity objects need to be sent over the network, don't they? At least they do in KS. So I think we need a local per-field PostingsReader object to hold such cached data. OK maybe not stored on them, but, accessible to them. Maybe cached in the SegmentReader. Though we need every norm(docID) lookup to be fast. Maybe we ask the per-field Similarity to give us a scorer, that holds the right byte[]? The insane loose typing of fields in Lucene is going to make it a little tricky to implement, though. I think you just have to exclude fields assigned to specific similarity implementations from your merge-anything-to-the-lowest-common-denominator policy and throw exceptions when there are conflicts rather than attempt to resolve them. Our disposition on conflict (throw exception vs silently coerce) should just match what we do today, which is to always silently coerce. What do you do when you have to reconcile two posting codecs like this? * doc id, freq, position, part-of-speech identifier * doc id, boost Do you silently drop all information except doc id? I don't know -- we haven't hit that yet ;) The closest we have is when doc id is merged with doc id,freq,position+, and in that case we drop the freq,position+. With flex this'll be up to the codec's merge methods. Similarity is where we decode norms right now. In my opinion, it should be the Similarity object from which we specify per-field posting formats. I agree. Great, I'm glad we're on the same page about that. Actually [sorry] I'm not longer so sure I agree! In flex we have a separate Codec class that's responsible
Composing posts for both JIRA and email (was a JIRA post)
(CC to lucy-dev and general, reply-to set to general) On Thu, Mar 04, 2010 at 06:18:28AM +, Shai Erera (JIRA) wrote: (Warning, this post is long, and is easier to read in JIRA) I consume email from many of the Lucene lists, and I hate it when people force me to read stuff via JIRA. It slows me down to have to jump to all those forum web pages. I only go the web page if there are 5 or more posts in a row on the same issue that I need to read. For what it's worth, I've worked out a few routines that make it possible to compose messages which read well in both mediums. * Never edit your posts unless absolutely necessary. If JIRA used diffs, things would be different, but instead it sends the whole frikkin' post twice (before and after), which makes it very difficult to see what was edited. If you must edit, append an edited: block at the end to describe what you changed instead of just making changes inline. * Use FireFox and the It's All Text plugin, which makes it possible to edit JIRA posts using an external editor such as Vim instead of typing into a textarea. http://trac.gerf.org/itsalltext * After editing, use the preview button (it's a little monitor icon to the upper right of the textarea) to make sure the post looks good in JIRA. * Use for quoting instead of JIRA's bq. and {quote} since JIRA's mechanisms look so crappy in email. This is easy from Vim, because rewrapping a long line (by typing gq from visual mode to rewrap the current selection) that starts with causes to be prepended to the wrapped lines. * Use asterisk bullet lists liberally, because they look good everywhere. * Use asterisks for *emphasis*, because that looks good everywhere. * If you wrap lines, use a reasonably short line length. (I use 78; Mike McCandless, who also wraps lines for his Jira posts, uses a smaller number). Otherwise you'll get nasty wrapping in narrow windows, both in email clients and web browsers. There are still a couple compromises that don't work out well. For email, ideally you want to set off code blocks with indenting: int foo = 1; int bar = 2; To make code look decent in JIRA, you have to wrap that with {code} tags, which unfortunately look heinous in email. Left-justifying the tags but indenting the code seems like it would be a rotten-but-salvageable compromise, as it at least sets off the tags visually rather than making them appear as though they are part of the code fragment. {code} int foo = 1; int bar = 2; {code} Unfortunately, that's going to look like this in JIRA, because of a bug that strips all leading whitespace from the first line. |-| | int foo;| | int bar;| |-| It seems that this has been fixed by Atlassian in the Confluence wiki (http://jira.atlassian.com/browse/CONF-4548), but the issue remains for the JIRA installation at issues.apache.org. So for now, I manually strip indentation until the whole block is flush left. {code} int foo = 1; int bar = 2; {code} (Gag. I vastly prefer wikis that automatically apply fixed-width styling to any indented text.) One last tip for Lucy developers (and other non-Java devs). JIRA has limited syntax highlighting support -- Java, JavaScript, ActionScript, XML and SQL only -- and defaults to assuming your code is Java. In general, you want to override that and tell JIRA to use none. {code:none} int foo = 1; int bar = 2; {code} Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841395#action_12841395 ] Michael McCandless commented on LUCENE-2293: bq. Cool! I wasn't sure if you wanted to give them private doc stores too. +1, I like it. I wasn't sure either ;) Ie, I forgot about that aspect of my proposal until it was raised in the discussion... but I think that'd be necessary. This will be a perf hit, when building up a big new index. But since doc stores now merge by bulk copy (when there are no deletions) hopefully the impact isn't too much. And, hopefully it's more than made up for by the improvement in IO/CPU interleaved concurrency. I'll work out a patch to at least make the hardwired 5 configurable... but does anyone out there wanna work out the private RAM segments? IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841407#action_12841407 ] Michael Busch commented on LUCENE-2293: --- bq. Yes, I think each DW will have to record its own buffered delete Term/Query, mapping to its docID at the time the delete arrived. I think in the future deletes in DW could work like this: - DW keeps of course track of a private sequence id, which gets incremented in the add, delete, update calls - a DW has a getReader() call, the reader can search the ram buffer - when DW.gerReader() gets called, then the new reader remembers the current seqID at the time it was opened - let's call it RAMReader.seqID; if such a reader gets reopened, simply its seqID gets updated. - we keep an growing int array with the size of DW's maxDoc, which replaces the usual deletes bitset - when DW.updateDocument() or .deleteDocument() needs to delete a doc we do that right away, before inverting the new doc. We can do that by running a query using a RAMReader to find all docs that must be deleted. Instead of flipping a bit in a bitset, for each hit we now keep track of when it was deleted: {code} // init each slot in deletes array with -1 static final int NOT_DELETED = Integer.MAX_INT; ... Arrays.fill(deletes, NOT_DELETED); ... public void deleteDocument(Query q) { reopen RAMReader run query q using RAMReader for each hit { int hitDocId = ... if (deletes[hitDocId] == NOT_DELETED) { deletes[hitDocId] = DW.seqID; } } ... DW.seqID++; } {code} Now no matter of how often you (re)open RAMReaders, they can share the deletes array. No cloning like with the BitSet approach would be necessary: When the RAMReader iterates posting lists it's as simple as this to treat deletes docs correctly. Instead of doing this in RAMTermDocs.next(): {code} if (deletedDocsBitSet.get(doc)) { skip this doc } {code} we can now do: {code} if (deletes[doc] ramReader.seqID) { skip this doc } {code} Here is an example: 1. Add 3 docs with DW.addDocument() 2. User opens ramReader_a 3. Delete doc 1 4. User opens ramReader_b After 1: DW.seqID = 2; deletes[]={MAX_INT, MAX_INT, MAX_INT} After 2: ramReader_a.seqID = 2 After 3: DW.seqID = 3; deletes[]={MAX_INT, 2, MAX_INT} After 3: ramReader_b.seqID = 3 Note that both ramReader_a and ramReader_b share the same deletes[] array. Now when ramReader_a is used to read posting lists, it will not treat doc 1 as deleted, because (deletes[1] ramReader_a.seqID) = (2 2) = false; But ramReader_b will see it as deleted, because (deletes[1] ramReader_b.seqID) = (2 3) = true. What do you think about this approach for the future when we have a searchable DW buffer? IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately
Re: Composing posts for both JIRA and email (was a JIRA post)
Marvin, thank you for taking the time to write up this great guidelines. Would you mind adding this to the wiki? I think this is very valuable for new devs and contributors. simon On Thu, Mar 4, 2010 at 6:28 PM, Marvin Humphrey mar...@rectangular.com wrote: (CC to lucy-dev and general, reply-to set to general) On Thu, Mar 04, 2010 at 06:18:28AM +, Shai Erera (JIRA) wrote: (Warning, this post is long, and is easier to read in JIRA) I consume email from many of the Lucene lists, and I hate it when people force me to read stuff via JIRA. It slows me down to have to jump to all those forum web pages. I only go the web page if there are 5 or more posts in a row on the same issue that I need to read. For what it's worth, I've worked out a few routines that make it possible to compose messages which read well in both mediums. * Never edit your posts unless absolutely necessary. If JIRA used diffs, things would be different, but instead it sends the whole frikkin' post twice (before and after), which makes it very difficult to see what was edited. If you must edit, append an edited: block at the end to describe what you changed instead of just making changes inline. * Use FireFox and the It's All Text plugin, which makes it possible to edit JIRA posts using an external editor such as Vim instead of typing into a textarea. http://trac.gerf.org/itsalltext * After editing, use the preview button (it's a little monitor icon to the upper right of the textarea) to make sure the post looks good in JIRA. * Use for quoting instead of JIRA's bq. and {quote} since JIRA's mechanisms look so crappy in email. This is easy from Vim, because rewrapping a long line (by typing gq from visual mode to rewrap the current selection) that starts with causes to be prepended to the wrapped lines. * Use asterisk bullet lists liberally, because they look good everywhere. * Use asterisks for *emphasis*, because that looks good everywhere. * If you wrap lines, use a reasonably short line length. (I use 78; Mike McCandless, who also wraps lines for his Jira posts, uses a smaller number). Otherwise you'll get nasty wrapping in narrow windows, both in email clients and web browsers. There are still a couple compromises that don't work out well. For email, ideally you want to set off code blocks with indenting: int foo = 1; int bar = 2; To make code look decent in JIRA, you have to wrap that with {code} tags, which unfortunately look heinous in email. Left-justifying the tags but indenting the code seems like it would be a rotten-but-salvageable compromise, as it at least sets off the tags visually rather than making them appear as though they are part of the code fragment. {code} int foo = 1; int bar = 2; {code} Unfortunately, that's going to look like this in JIRA, because of a bug that strips all leading whitespace from the first line. |-| | int foo; | | int bar; | |-| It seems that this has been fixed by Atlassian in the Confluence wiki (http://jira.atlassian.com/browse/CONF-4548), but the issue remains for the JIRA installation at issues.apache.org. So for now, I manually strip indentation until the whole block is flush left. {code} int foo = 1; int bar = 2; {code} (Gag. I vastly prefer wikis that automatically apply fixed-width styling to any indented text.) One last tip for Lucy developers (and other non-Java devs). JIRA has limited syntax highlighting support -- Java, JavaScript, ActionScript, XML and SQL only -- and defaults to assuming your code is Java. In general, you want to override that and tell JIRA to use none. {code:none} int foo = 1; int bar = 2; {code} Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841463#action_12841463 ] Shai Erera commented on LUCENE-2293: What about the following scenario: # A document is added w/ term A to DW1 # A document is added w/ term A to DW2 (by another thread) # A deleteDocuments(Term-A) is issued against DW1 (could be even 3, where A does not exist) I thought that when (3) happens, the delete-by-term needs to be issued against all DWs, so that later when they apply their deletes they'll *remember* to do so. Issuing that against all DWs will record the docID of each DW up until which the delete should apply. We could move to doing the delete right-away, by reopening a DW reader, and we could move to storing deletes in int[] rather than bit set. But I'm not sure I understand how your proposal will handle the scenario I've described. Also, I don't see the advantage of moving to store the deletes in int[] rather than bitset ... is it just to avoid calling the get(doc)? IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Request for clarification on unordered SpanNearQuery
Paul (and Mark), Thank you for answering. Do you suppose not really straightforward means 40 hours or something like that? I'm just trying to get an idea of whether what I'm attempting is worth the effort. Mike -Original Message- From: java-dev-return-47351-michael.j.goddard=saic@lucene.apache.org on behalf of Paul Elschot Sent: Thu 3/4/2010 11:51 AM To: java-dev@lucene.apache.org Subject: Re: Request for clarification on unordered SpanNearQuery Michael, The test for the 4th range fails because the first matching subspans (for t1 in this case) is always the one that is first advanced, and the first match at that point has a less slop (0) than the maximum allowed (1) so one might actually try and advance another subspans first. But that is not really straightforward to implement, especially when different terms can be indexed in the same position. Perhaps the javadocs for the unordered case should be improved to mention that in the unordered case the first subspans is always the one that is advanced first. Regards, Paul Elschot Op donderdag 04 maart 2010 17:34:26 schreef Goddard, Michael J.: I've been working on some highlighting changes involving Spans (https://issues.apache.org/jira/browse/LUCENE-2287) and could use some help understanding when overlapping Spans are valid. To illustrate, I added the test below to the TestSpans class; this test fails because there is no fourth range. Am I wrong in my expectation that that last range would match? Thanks. Mike // Doc 11 contains t1 t2 t1 t3 t2 t3 public void testSpanNearUnOrderedOverlap() throws Exception { boolean ordered = false; int slop = 1; SpanNearQuery snq = new SpanNearQuery( new SpanQuery[] { makeSpanTermQuery(t1), makeSpanTermQuery(t2), makeSpanTermQuery(t3) }, slop, ordered); Spans spans = snq.getSpans(searcher.getIndexReader()); assertTrue(first range, spans.next()); assertEquals(first doc, 11, spans.doc()); assertEquals(first start, 0, spans.start()); assertEquals(first end, 4, spans.end()); assertTrue(second range, spans.next()); assertEquals(second doc, 11, spans.doc()); assertEquals(second start, 1, spans.start()); assertEquals(second end, 4, spans.end()); assertTrue(third range, spans.next()); assertEquals(third doc, 11, spans.doc()); assertEquals(third start, 2, spans.start()); assertEquals(third end, 5, spans.end()); // Question: why wouldn't this Span be found? assertTrue(fourth range, spans.next()); assertEquals(fourth doc, 11, spans.doc()); assertEquals(fourth start, 2, spans.start()); assertEquals(fourth end, 6, spans.end()); assertFalse(fifth range, spans.next()); } - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org winmail.dat - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841545#action_12841545 ] Michael Busch commented on LUCENE-2293: --- {quote} I thought that when (3) happens, the delete-by-term needs to be issued against all DWs, so that later when they apply their deletes they'll remember to do so. Issuing that against all DWs will record the docID of each DW up until which the delete should apply. {quote} Yes, you still need to apply deletes on all DWs. My approach is not different in that regard. {quote} Also, I don't see the advantage of moving to store the deletes in int[] rather than bitset ... is it just to avoid calling the get(doc)? {quote} The big advantage is that all (re)opened readers can share the single int[] array. If you use a bitset you need to clone it for each reader. With the int[] reopening becomes basically free from a deletes perspective. IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841574#action_12841574 ] Earwin Burrfoot commented on LUCENE-2294: - I voted for killing these delegating methods some time ago. It ended in nothing, so I vote again, #3 :) Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841595#action_12841595 ] Yonik Seeley commented on LUCENE-2294: -- Yay, we'll be able to remove SolrIndexConfig and use this :-) Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841617#action_12841617 ] Michael Busch commented on LUCENE-2293: --- bq. The big advantage is that all (re)opened readers can share the single int[] array. Dirty reads will be a problem with sharing the array. An AtomicIntegerArray could be used. We need to experiment how expensive that would be. IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841694#action_12841694 ] Shai Erera commented on LUCENE-2294: Ok, then I'll proceed w/ #3. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org