[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841120#action_12841120
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
Also, in the pull approach, Lucene would introduce another place where it 
allocates threads.
{quote}

What I described is not much different from what's happening today. 
DocumentsWriter has already a WaitQueue, that ensures that the docs are written 
in the right order.

I simply tried to suggest a way to refactor our classes... functionally the 
same as what Mike suggested. I shouldn't have said pulled from (the queue).

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Lucene Filter

2010-03-04 Thread Dyutiman

yaa... and now I am trying with multiple filters. Thanks
-- 
View this message in context: 
http://old.nabble.com/Lucene-Filter-tp27756577p27778081.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841127#action_12841127
 ] 

Shai Erera commented on LUCENE-2293:


bq. What I described is not much different from what's happening today.

Maybe I didn't understand then:
{quote}
basically a load balancer, that multiple DocumentsWriter instances would pull 
from as soon as they are done inverting the previous document?
{quote}

Who adds documents to that queue and what are the DW instances? The way I read 
it, I understood those are different threads than the application threads. If I 
misunderstood that, could you please clarify?

Also, I thought that each thread writes to different ThreadState does not 
ensure documents are written in order, but that finally when DW flushes, the 
different ThreadStates are merged together and one segment is written, somehow 
restores the orderness ...

If only WaitQueue was documented :).

I obviously don't know that part of the code as well as you. So if I 
misunderstood your meaning, I'd appreciate if you clarify it for me. What I 
would like to avoid is having Lucene allocate indexing threads on its own.

Also, is my proposal above different than what you suggest?

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841135#action_12841135
 ] 

Michael Busch commented on LUCENE-2293:
---

Sorry - after reading my comment again I can see why it was confusing. 
Loadbalancer wasn't a very good analogy.

I totally agree that Lucene should still piggyback on the application's threads 
and not start its own thread for document inversion.

Today, as you said, does the DocumentsWriter manage a certain number of thread 
states, has the WaitQueue, and its own memory management.

What I was thinking was that it would be simpler if the DocumentsWriter was 
only used by a single thread. The IndexWriter would have multiple 
DocumentsWriters and do the thread binding (+waitqueue). This would make the 
code in DocumentsWriter and the downstream classes simpler. The side-effect is 
that each DocumentsWriter would manage its own memory. 

{quote}
Also, I thought that each thread writes to different ThreadState does not 
ensure documents are written in order, but that finally when DW flushes, the 
different ThreadStates are merged together and one segment is written, somehow 
restores the orderness ...
{quote}

Stored fields are written to an on-disk stream (docstore) in order. The 
WaitQueue takes care of finishing the docs in the right order. 
The postings are written into TermHashes per threadstate in parallel. The doc 
ids are in increasing order, but can have gaps. E.g. Threadstate 1 inverts doc 
1 and 3, Threadstate 2 inverts doc 2. When it's time to flush the whole buffer 
these different TermHash postingslists get interleaved.

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841140#action_12841140
 ] 

Shai Erera commented on LUCENE-2293:


Ok so I think I understand now. You propose to change IW to bind a Thread to a 
DW, instead of that being done inside DW. And therefore it will simplify DW's 
code ... I wonder if that won't complicate IW code in return? Perhaps we'll 
gain a lot of simplification on DW, so a bit of complexity on IW will be ok.

If we do that .. why not renaming DW to SegmentWriter? If each DW will 
eventually flush its own Segment, the name would make more sense?

BTW, I was thinking that an application can emulate this sort of thing even 
today (well ... to some extent - w/o deletes). It can create an IW for each 
indexing thread and at the end call addIndexes. What we'd need to introduce on 
IW to make it efficient though is something like addRawIndexes, which will just 
update the segments file about the new segments, but won't attempt to merge 
them and clean deletes out of them.
I think I want this API anyway for being able to add segments faster to an 
index, if e.g. you don't care about the merges at the moment ... but that is 
separate issue.

Then I think what I proposed is more or less the same as you propose, therefore 
I'm fine with that approach. When a DW/SW realizes it exhausted its memory 
pool, it just flushes and new threads will bind to other DW/SW.

Thanks for the explanation on WaitQueue.

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Shai Erera (JIRA)
Create IndexWriterConfiguration and store all of IW configuration there
---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


I would like to factor out of all IW configuration parameters into a single 
configuration class, which I propose to name IndexWriterConfiguration (or 
IndexWriterConfig). I want to store there almost everything besides the 
Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
IndexWriterConfiguration). What I was thinking of storing there are the 
following parameters:
* All of ctors parameters, except for Directory.
* The different setters where it makes sense. For example I still think 
infoStream should be set on IW directly.

I'm thinking that IWC should expose everything in a setter/getter methods, and 
defaults to whatever IW defaults today. Except for Analyzer which will need to 
be defined in the ctor of IWC and won't have a setter.

I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a 
DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 
should be the default? Why not default to UNLIMITED and otherwise let the 
application decide what LIMITED means for it? I would like to make MFL optional 
on IWC and default to something, and I hope that default will be UNLIMITED. We 
can document that on IWC, so that if anyone chooses to move to the new API, he 
should be aware of that ...

I plan to deprecate all the ctors and getters/setters and replace them by:
* One ctor as described above
* getIndexWriterConfiguration, or simply getConfig, which can then be queried 
for the setting of interest.
* About the setters, I think maybe we can just introduce a setConfig method 
which will override everything that is overridable today, except for Analyzer. 
So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig);
** The setters on IWC can return an IWC to allow chaining set calls ... so the 
above will turn into 
iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 

BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
will greatly simplify IW's API.

I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841152#action_12841152
 ] 

Uwe Schindler commented on LUCENE-2294:
---

+1 for the IndexWriterConfig with chaining method calls

We had a discussion about this a while ago on the mailinglist: 
[http://www.lucidimagination.com/search/document/d32100d8a7b67366/lucene_2_9_and_deprecated_ir_open_methods#19e1a19f4d340b8c]

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841152#action_12841152
 ] 

Uwe Schindler edited comment on LUCENE-2294 at 3/4/10 10:00 AM:


+1 for the IndexWriterConfig with chaining method calls

We had a discussion about this a while ago on the mailinglist: 
[http://www.lucidimagination.com/search/document/19e1a19f4d340b8c/lucene_2_9_and_deprecated_ir_open_methods]

  was (Author: thetaphi):
+1 for the IndexWriterConfig with chaining method calls

We had a discussion about this a while ago on the mailinglist: 
[http://www.lucidimagination.com/search/document/d32100d8a7b67366/lucene_2_9_and_deprecated_ir_open_methods#19e1a19f4d340b8c]
  
 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841166#action_12841166
 ] 

Shai Erera commented on LUCENE-2294:


Thanks Uwe for the pointer.

I suppose that MaxFieldLength should now move to IndexWriterConfig, only it is 
public and therefore needs to be deprecated. Otherwise it will look strange 
that in order to set MFL on IWC you need to reference IW. So deprecate and 
duplicate? To IW it doesn't matter because it just takes the limit (int) from 
MFL ...

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841172#action_12841172
 ] 

Uwe Schindler commented on LUCENE-2294:
---

In my opinion the whole class is unneeded, so only deprecate in IW but not add 
it to IWConfig. For me a constant in IWConfig would be enough that defines 
UNLIMITED and everything else is just an integer. +1 for defaulting to static 
final UNLIMITED=Integer.MAX_VALUE. I am not sure why this limitation is there 
at all. In my opinion it should be left to the apploication to limit the number 
of tokens if needed, but not silently drop tokens. If somebody gets an OOM, he 
can adjust the value and knows that mayabe some tokens get lost.


 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841173#action_12841173
 ] 

Shai Erera commented on LUCENE-2294:


Yeah I don't like it either (makes my code unnecessarily long). And I always 
use UNLIMITED, and the LIMITED=10,000 is really just a guess, and so if anyone 
wants to limit it, he needs to do new MaxFieldLength(otherLimit) which is 
unnecessarily long as well ...

I like it - I'll deprecate on IW and introduce UNLIMITED on IWC.

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Fwd: (SOLR-355) Parsing mixed inclusive/exclusive range queries

2010-03-04 Thread Michael McCandless
If Solr/Lucene dev were merged, and queryParser is it's own module,
this user could simply upgrade his queryParser JAR to get this fix.

Mike

-- Forwarded message --
From: Alexander S (JIRA) j...@apache.org
Date: Thu, Mar 4, 2010 at 2:24 AM
Subject: (SOLR-355)  Parsing mixed inclusive/exclusive range queries
To: solr-...@lucene.apache.org



   [ 
https://issues.apache.org/jira/browse/SOLR-355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841105#action_12841105
]

Alexander S commented on SOLR-355:
--

It is fixed in Lucene, can we get it to SOLR?

 Parsing mixed inclusive/exclusive range queries
 ---

                 Key: SOLR-355
                 URL: https://issues.apache.org/jira/browse/SOLR-355
             Project: Solr
          Issue Type: Improvement
          Components: search
    Affects Versions: 1.2
            Reporter: Andrew Schurman
            Priority: Minor
         Attachments: solr-355.patch


 The current query parser doesn't handle parsing a range query (i.e. 
 ConstantScoreRangeQuery) with mixed inclusive/exclusive bounds.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841189#action_12841189
 ] 

Shai Erera commented on LUCENE-2294:


I was wondering if perhaps instead of allowing to pass a create=true/false, we 
should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND. The current 
meaning of create is a bit unclear. I.e. if it is true, then overwrite. But if 
it is false, don't attempt to create, but just open an existing one. However if 
the directory is empty, it throws an exception. I think an enum would someone 
to pass CREATE_OR_APPEND in case he doesn't know if there is an index there ... 
but I don't want to complicate things unnecessarily ... what do you think?

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841193#action_12841193
 ] 

Earwin Burrfoot commented on LUCENE-2293:
-

bq. I wonder if that won't complicate IW code in return? Perhaps we'll gain a 
lot of simplification on DW, so a bit of complexity on IW will be ok.
That will get rid of all that *PerThread insanity for each DW component, if I'm 
getting it right. That's -13 classes. Yay for the issue!

On a random sidenote, can we group things like these into subpackages? Having 
132 files in oal.index is somewhat intimidating when trying to read/understand 
things.

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841199#action_12841199
 ] 

Shai Erera commented on LUCENE-2294:


IndexingChain is one of the things that can be set on IW, however I don't see 
any implementations of it besides the default, and the class itself is 
package-private, so no app could actually set it on IW (unless it puts its code 
under o.a.l.index). Therefore I'm thinking of not introducing it on IWC, or 
turn it to a public class?
Is it really something we expect any application out there to set, or can we 
simply make DocsWriter impl one for itself internally, and don't declare this 
class as abstract etc.?

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841207#action_12841207
 ] 

Shai Erera commented on LUCENE-2294:


I'm thinking to make this whole IWC a constructor only parameter to IW, without 
the ability to set it afterwards. I don't see any reason why would anyone 
change the RAM limit, Similarity etc while IW is running. What's the advantage 
vs. say close the current IW and open a new one with the different settings? I 
know the latter is more expensive, and I write it deliberately - I think those 
settings are really ctor-only settings. Otherwise you might get inconsistent 
documents (like changing the Similarity or max field length).

This will also simplify IWC, because now I need to distinguish between settings 
that cannot be altered afterwards, like changing IndexDeletionPolicy, create, 
IndexCommit, Analyzer ... if IWC will be a ctor only object, I can have only 
the default ctor (to init to default settings) and provide the setters 
otherwise.

Any objections?

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841225#action_12841225
 ] 

Michael McCandless commented on LUCENE-2293:


I agree IW should not spawn its own threads.  It should piggy back on
incoming threads.

On whether we can remove the perThread layer throughout the chain --
that would be compelling.  But, we should scrutinize what that layer
does throughout the current chain to assess what we might lose.

But, I was proposing a bigger change (call it private RAM segments):
there would be multiple DWs, each one writing to its own private RAM
segment (each one getting private docID assignment) *and* its own doc
stores.

There would be no more WaitQueue in IW.

Each DW would flush its own segment privately.  They would not all
flush at once (merging their postings) like we must do today because
they share a single docID space.

As I understand it, this would be step towards how Lucy handles
concurrency during indexing.  Ie, it'd make the DWs nearly fully
independent from one another, and then IW is just there to dispatch/do
merging/etc.  (In Lucy each writer is a separate process, I think --
VERY independent).

We could do both changes, too (remove the perThread layer of
indexing chaing and switch to private RAM segments) -- I think they
are actually orthogonal.

bq. The other downside is that you would have to buffer deleted docs and 
queries separately for each thread state, because you have to keep the private 
docID? So that would nee a bit more memory.

Right.

bq. Mike, good one! Would having a doc id stream per thread make implementing a 
searchable RAM buffer easier?

Yes -- they would just appear like sub segments.

bq. I hope we won't lose monotonic docIDs for a singlethreaded indexation 
somewhere along that path.

We won't.

{quote}
Instead, I prefer to take advantage of the application's concurrency level in 
the following way:

* Each thread will continue to write documents to a ThreadState. We'll allow 
changing the MAX_LEVEL, so if an app wants to get more concurrency, it can.
  - MAX_LEVEL will set the number of ThreadState objects available.
* All threads will obtain memory buffers from a pull which will be limited by 
IW's RAM limit.
* When a thread finishes indexing a document and realizes the pool has been 
exhausted, it flushes its ThreadState.
  - At that moment, that ThreadState is pulled out of the 'active' list and is 
flushed. When it's done, it reclaims its used buffers and being put again in 
the active list.
  - New threads that come in will simply pick a ThreadState from the pool (but 
we'll bind them to that instance until it's flushed) and add documents to them.
  - That way, we hijack an application thread to do the flushing, which is 
anyway what happens today.
{quote}

+1 -- this I think matches what I was thinking.

bq. If only WaitQueue was documented

Sorry :(

But WaitQueue would go away with this change.  We would no longer have
shared doc stores!


 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush 

Vote on merging dev of Lucene and Solr

2010-03-04 Thread Mark Miller
For those committers that don't follow the general mailing list, or 
follow it that closely, we are currently having a vote for committers:


http://search.lucidimagination.com/search/document/4722d3144c2e3a8b/vote_merge_lucene_solr_development

--
- Mark

http://www.lucidimagination.com





[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841232#action_12841232
 ] 

Michael McCandless commented on LUCENE-2294:


+1 -- this is great!

bq. I am not sure why MaxFieldLength is required in all IW ctors, yet IW 
declares a DEFAULT (which is an int and not MaxFieldLength). 

This is because it's a dangerous setting (you silently lose content
while indexing), a trap.  So we want to force the user to make the
choice, up front, so they realize the implications.

But, if we change the default to UNLIMITED (which we should do under
Version), then I agree you should not have to specify it.

bq. In my opinion it should be left to the apploication to limit the number of 
tokens if needed, but not silently drop tokens

I like that approach -- we could make a TokenFilter to do this?  Then
we don't need MFL at all in IWC (and deprecate in IW).

bq. I was wondering if perhaps instead of allowing to pass a create=true/false, 
we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND

+1

bq. I'm thinking to make this whole IWC a constructor only parameter to IW, 
without the ability to set it afterwards.

+1 in general, though we should go setting by setting to confirm this is OK.  I
don't know of real use cases where apps eg want to change RAM buffer
or mergeFactor... but maybe there are some interesting usages out
there.


 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841235#action_12841235
 ] 

Shai Erera commented on LUCENE-2293:


Perhaps instead of buffering the delete Terms/Queries somewhere central, when a 
delete by term is performed by a certain DW, it can register it immediately on 
all existing DWs. Each DW will record the doc ID up until which this term 
delete should be executed, and when it's its time to flush, will apply all the 
deletes that were accumulated on itself. It'll be like doing a Parallel segment 
deletes (but maybe I'm too into Parallel Indexing :)).

This should not affect any documents that were added to any DW after the delete 
happened, and if we simply do it (sycned) across all active DWs, I think we 
should be fine?

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841239#action_12841239
 ] 

Shai Erera commented on LUCENE-2294:


bq. But, if we change the default to UNLIMITED

Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the 
new API can define whatever he wants. We'll default to UNLIMITED but there 
won't be any back-compat issue ...

bq. we could make a TokenFilter to do this?

I'm afraid that will result in changing all Analyzers to work properly? Or you 
mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this 
filter? That could work, but as soon as that becomes a filter, people may use 
it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em 
down?). Also, if I'd use such a filter myself, I wouldn't put it last in the 
chain, so that I can avoid doing any processing on a term that is not going to 
end up in the index. Although that's not too critical because I'll be doing 
this for just one term ...

I guess I'd like to keep it as it is now, not turning the issue into a bigger 
thing ... and a filter alone won't solve it - we'd still need to provide a way 
to configure it, or otherwise everyone will need to wrap their Analyzers with 
such filter?

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249
 ] 

Mark Miller commented on LUCENE-2294:
-

I can see the value in this - there are a bunch of IW constructors - but 
personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. 
Reminds me of windows C programming and structs. When I'm just coding away, its 
so much easier to just enter the params in the cnstr. And it seems like it 
would be more difficult to know whats *required* to set on the config class - 
without the same cstr business ...

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249
 ] 

Mark Miller edited comment on LUCENE-2294 at 3/4/10 1:45 PM:
-

I can see the value in this - there are a bunch of IW constructors - but 
personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. 
Reminds me of windows C programming and structs. When I'm just coding away, its 
so much easier to just enter the params in the cnstr. And it seems like it 
would be more difficult to know whats *required* to set on the config class - 
without the same cstr business ...

*edit*

Though I suppose the chaining *does* makes this more swallowable...

new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ...

  was (Author: markrmil...@gmail.com):
I can see the value in this - there are a bunch of IW constructors - but 
personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. 
Reminds me of windows C programming and structs. When I'm just coding away, its 
so much easier to just enter the params in the cnstr. And it seems like it 
would be more difficult to know whats *required* to set on the config class - 
without the same cstr business ...
  
 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841262#action_12841262
 ] 

Shai Erera commented on LUCENE-2294:


I wouldn't worry about what's required - Directory will be left out, MFL is 
useless and a pain anyway, so what's left is Analyzer. I can put Analyzer on 
IWC's ctor, but I personally think we can default to a simple one (such as 
Whitespace) encouraging the people to set their own. I find it very annoying 
today when I want to test something about IW and I need to pass all these 
things to IW ...

The way I see it, those who want to rely on Lucene's latest and greatest can 
just do: IndexWriter writer = new IndexWriter(dir, new IWC()); Well maybe 
except for the Analyzer, but I really don't think it matters that much. And 
like you wrote, someone can chain the setters. So win-win? If you don't care 
about anything, just wants to open a writer, index something and that's it, you 
don't need to specify anything .. otherwise you just chain calls?

One thing I should add to IWC so far (I hope to post a patch even today) is a 
Version parameter. For now it will be ignored, but as a placeholder to change 
settings in the future.

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Vote on merging dev of Lucene and Solr

2010-03-04 Thread Noble Paul നോബിള്‍ नोब्ळ्
+1

On Thu, Mar 4, 2010 at 6:32 PM, Mark Miller markrmil...@gmail.com wrote:
 For those committers that don't follow the general mailing list, or follow
 it that closely, we are currently having a vote for committers:

 http://search.lucidimagination.com/search/document/4722d3144c2e3a8b/vote_merge_lucene_solr_development

 --
 - Mark

 http://www.lucidimagination.com






-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841339#action_12841339
 ] 

Michael McCandless commented on LUCENE-2294:


bq. Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to 
the new API can define whatever he wants. We'll default to UNLIMITED but there 
won't be any back-compat issue ..

Ahh sorry right.  In the olden days, 1 was the default.

{quote}
bq. we could make a TokenFilter to do this?

I'm afraid that will result in changing all Analyzers to work properly? Or you 
mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this 
filter? That could work, but as soon as that becomes a filter, people may use 
it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em 
down?). 
{quote}

Hmm yeah quite a hassle to fix all analyzers.  Hmmm.

bq. I guess I'd like to keep it as it is now, not turning the issue into a 
bigger thing ... and a filter alone won't solve it - we'd still need to provide 
a way to configure it, or otherwise everyone will need to wrap their Analyzers 
with such filter?

Maybe one solution is to wrap any other analyzer?  Ie, create a 
StopAfterNTokensAnalyzer,  taking another analyzer that it delegates to, and 
then sticking on this StopAfterNTokensFilter to each token stream.

But yeah maybe break this out as a separate issue...

bq. Also, if I'd use such a filter myself, I wouldn't put it last in the chain, 
so that I can avoid doing any processing on a term that is not going to end up 
in the index. Although that's not too critical because I'll be doing this for 
just one term ...

Actually it ought to be 0 terms wasted, with the filter @ the end -- with this 
StopAfterNTokensFilter, it'll immediately return false w/o asking for the 
10001th token.

bq. One thing I should add to IWC so far (I hope to post a patch even today) is 
a Version parameter. For now it will be ignored, but as a placeholder to change 
settings in the future.

+1

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841342#action_12841342
 ] 

Jason Rutherglen commented on LUCENE-2293:
--

bq. But WaitQueue would go away with this change.  We would no longer have 
shared doc stores!

Cool, most of the DW code is intuitive except the shared doc stores because 
it's hard to when see when a doc store ends.  Also the interleaving is a bit 
difficult to visualize.  I look forward to checking out DW after this change.  

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841347#action_12841347
 ] 

Michael McCandless commented on LUCENE-2293:


Yes, I think each DW will have to record its own buffered delete Term/Query, 
mapping to its docID at the time the delete arrived.

Syncing across all of them would work but may be overkill.  I think we could 
instead have a lock free collection (need not even be FIFO -- the order doesn't 
matter) into which we add all Term/Query that are deleted.  Then, any time a 
thread hits that DW to add a document, it must first service that queue, by 
popping out all Term/Query stored in it and enrolling them the un-synchronized 
map of Term/Query - docID).

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Request for clarification on unordered SpanNearQuery

2010-03-04 Thread Goddard, Michael J.
I've been working on some highlighting changes involving Spans 
(https://issues.apache.org/jira/browse/LUCENE-2287) and could use some help 
understanding when overlapping Spans are valid.  To illustrate, I added the 
test below to the TestSpans class; this test fails because there is no fourth 
range.

Am I wrong in my expectation that that last range would match?

Thanks.

  Mike


  // Doc 11 contains t1 t2 t1 t3 t2 t3
  public void testSpanNearUnOrderedOverlap() throws Exception {
boolean ordered = false;
int slop = 1;
SpanNearQuery snq = new SpanNearQuery(
  new SpanQuery[] {
makeSpanTermQuery(t1),
makeSpanTermQuery(t2),
makeSpanTermQuery(t3) },
  slop,
  ordered);
Spans spans = snq.getSpans(searcher.getIndexReader());

assertTrue(first range, spans.next());
assertEquals(first doc, 11, spans.doc());
assertEquals(first start, 0, spans.start());
assertEquals(first end, 4, spans.end());

assertTrue(second range, spans.next());
assertEquals(second doc, 11, spans.doc());
assertEquals(second start, 1, spans.start());
assertEquals(second end, 4, spans.end());

assertTrue(third range, spans.next());
assertEquals(third doc, 11, spans.doc());
assertEquals(third start, 2, spans.start());
assertEquals(third end, 5, spans.end());

// Question: why wouldn't this Span be found?
assertTrue(fourth range, spans.next());
assertEquals(fourth doc, 11, spans.doc());
assertEquals(fourth start, 2, spans.start());
assertEquals(fourth end, 6, spans.end());

assertFalse(fifth range, spans.next());
  }



Re: Request for clarification on unordered SpanNearQuery

2010-03-04 Thread Mark Miller

On 03/04/2010 11:34 AM, Goddard, Michael J. wrote:

// Question: why wouldn't this Span be found?
assertTrue(fourth range, spans.next());
assertEquals(fourth doc, 11, spans.doc());
assertEquals(fourth start, 2, spans.start());
assertEquals(fourth end, 6, spans.end());


Spans are funny beasts ;)

No Spans ever start from the same position more than once. In effect, 
they are always marching forward.


The third range starts at 2, and once it finds a match starting at 2, it 
moves on. So it won't find the other
match that starts at 2. Spans are not exhaustive - exhaustive matching 
would be a different algorithm.


So yes, you are wrong in your expectation :) Just how Spans were 
implemented.


--
- Mark

http://www.lucidimagination.com





[jira] Resolved: (LUCENE-2283) Possible Memory Leak in StoredFieldsWriter

2010-03-04 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2283.


Resolution: Fixed

Thanks Tim!

 Possible Memory Leak in StoredFieldsWriter
 --

 Key: LUCENE-2283
 URL: https://issues.apache.org/jira/browse/LUCENE-2283
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Tim Smith
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2283.patch, LUCENE-2283.patch, LUCENE-2283.patch


 StoredFieldsWriter creates a pool of PerDoc instances
 this pool will grow but never be reclaimed by any mechanism
 furthermore, each PerDoc instance contains a RAMFile.
 this RAMFile will also never be truncated (and will only ever grow) (as far 
 as i can tell)
 When feeding documents with large number of stored fields (or one large 
 dominating stored field) this can result in memory being consumed in the 
 RAMFile but never reclaimed. Eventually, each pooled PerDoc could grow very 
 large, even if large documents are rare.
 Seems like there should be some attempt to reclaim memory from the PerDoc[] 
 instance pool (or otherwise limit the size of RAMFiles that are cached) etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Request for clarification on unordered SpanNearQuery

2010-03-04 Thread Paul Elschot
Michael,

The test for the 4th range fails because the first matching subspans
(for t1 in this case) is always the one that is first advanced, and the first
match at that point has a less slop (0) than the maximum allowed (1)
so one might actually try and advance another subspans first.
But that is not really straightforward to implement, especially when different
terms can be indexed in the same position.

Perhaps the javadocs for the unordered case should be improved to mention
that in the unordered case the first subspans is always the one that is
advanced first.

Regards,
Paul Elschot

Op donderdag 04 maart 2010 17:34:26 schreef Goddard, Michael J.:
 I've been working on some highlighting changes involving Spans 
 (https://issues.apache.org/jira/browse/LUCENE-2287) and could use some help 
 understanding when overlapping Spans are valid.  To illustrate, I added the 
 test below to the TestSpans class; this test fails because there is no fourth 
 range.
 
 Am I wrong in my expectation that that last range would match?
 
 Thanks.
 
   Mike
 
 
   // Doc 11 contains t1 t2 t1 t3 t2 t3
   public void testSpanNearUnOrderedOverlap() throws Exception {
 boolean ordered = false;
 int slop = 1;
 SpanNearQuery snq = new SpanNearQuery(
   new SpanQuery[] {
 makeSpanTermQuery(t1),
 makeSpanTermQuery(t2),
 makeSpanTermQuery(t3) },
   slop,
   ordered);
 Spans spans = snq.getSpans(searcher.getIndexReader());
 
 assertTrue(first range, spans.next());
 assertEquals(first doc, 11, spans.doc());
 assertEquals(first start, 0, spans.start());
 assertEquals(first end, 4, spans.end());
 
 assertTrue(second range, spans.next());
 assertEquals(second doc, 11, spans.doc());
 assertEquals(second start, 1, spans.start());
 assertEquals(second end, 4, spans.end());
 
 assertTrue(third range, spans.next());
 assertEquals(third doc, 11, spans.doc());
 assertEquals(third start, 2, spans.start());
 assertEquals(third end, 5, spans.end());
 
 // Question: why wouldn't this Span be found?
 assertTrue(fourth range, spans.next());
 assertEquals(fourth doc, 11, spans.doc());
 assertEquals(fourth start, 2, spans.start());
 assertEquals(fourth end, 6, spans.end());
 
 assertFalse(fifth range, spans.next());
   }
 
 

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841388#action_12841388
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
But, I was proposing a bigger change (call it private RAM segments):
there would be multiple DWs, each one writing to its own private RAM
segment (each one getting private docID assignment) and its own doc
stores.
{quote}

Cool! I wasn't sure if you wanted to give them private doc stores too. +1, I 
like it.



 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-04 Thread Michael McCandless
On Tue, Mar 2, 2010 at 4:12 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote:
 The problem is, these scoring models need the avg field length (in
 tokens) across the entire index, to compute the norms.

 Ie, you can't do that on writing a single segment.

 I don't see why not.  We can just move everything you're doing on
 Searcher open to index time, and calculate the stats and norms
 before writing the segment out.

 At search time, the only segment with valid norms would be the last
 one, so we'd make sure the Searcher used those.

I see -- write norms for all segments (the full index) on each commit?
OK.

And in fact if we left it at searcher init time, you'd still
[technically] have to recompute the norms arrays across all segments
whenever one even tiny segment was added, since [technically] the
average has changed.  But I agree, once the index is large enough,
presumably the average won't change much, so...

Even in the NRT case we'd have to compute norms across the entire
index with only a small segment added.

 I think the fact that Lucy always writes one segment per indexing session --
 as opposed to Lucene's one segment per document -- makes a difference here.

Lucene isn't one segment per doc anymore -- it's one segment
per-when-RAM-buffer-filled-up.  Not sure it really makes a difference
though, since we [technically] need norms regen'd for the entire
index.

 Whether burning norms to disk at index time is the most efficient
 setup depends on the ratio of commits to searcher-opens.

Yes, and NRT opens.

 In a multi-node search cluster, pre-calculating norms at index-time
 wouldn't work well without additional communication between nodes to
 gather corpus-wide stats.  But I suspect the same trick that works
 for IDF in large corpuses would work for average field length: it
 will tend to be the stable over time, so you can update it
 infrequently.

Right I imagine we'd need to use this trick within a single index,
too.  Recomputing norms for entire index when only a small new segment
was added to the new NRT reader will probably be too costly.

Though one alternative (if you don't mind burning RAM) is to skip
casting to norms, ie store the actual field length, and do the
divide-by-avg during scoring (though that's a biggish hit to search
perf).

 So I think it must be done during searcher init.

 The most we can do is store the aggregates (eg sum of all lengths in
 this segment) in the SegmentInfo -- this saves one pass on searcher
 init.

 Logically...

   token_counts: {
   segment: {
   title: 4,
   content: 154,
   },
   all: {
   title: 98342,
   content: 2854213
   }
   }

 (Would that suffice?  I don't recall the gory details of BM25.)

I think so, though why store all, per segment?  Reader can regen on
open?  (That above json comes from a single segment right?).

lnu.ltc would need sum(avg(tf)) as well.

 As documents get deleted, the stats will gradually drift out of
 sync, just like doc freq does.  However, that's mitigated if you
 recycle segments that exceed a threshold deletion percentage on a
 regular basis.

Right.

 The norms array will be stored in this per-field sim instance.

 Interesting, but that wasn't where I was thinking of putting them.
 Similarity objects need to be sent over the network, don't they?  At
 least they do in KS.  So I think we need a local per-field
 PostingsReader object to hold such cached data.

OK maybe not stored on them, but, accessible to them.  Maybe cached in
the SegmentReader.

Though we need every norm(docID) lookup to be fast.  Maybe we ask the
per-field Similarity to give us a scorer, that holds the right byte[]?

  The insane loose typing of fields in Lucene is going to make it a
  little tricky to implement, though.  I think you just have to
  exclude fields assigned to specific similarity implementations from
  your merge-anything-to-the-lowest-common-denominator policy and
  throw exceptions when there are conflicts rather than attempt to
  resolve them.

 Our disposition on conflict (throw exception vs silently coerce)
 should just match what we do today, which is to always silently
 coerce.

 What do you do when you have to reconcile two posting codecs like this?

  * doc id, freq, position, part-of-speech identifier
  * doc id, boost

 Do you silently drop all information except doc id?

I don't know -- we haven't hit that yet ;)  The closest we have is
when doc id is merged with doc id,freq,position+, and in that
case we drop the freq,position+.

With flex this'll be up to the codec's merge methods.

  Similarity is where we decode norms right now.  In my opinion, it
  should be the Similarity object from which we specify per-field
  posting formats.

 I agree.

 Great, I'm glad we're on the same page about that.

Actually [sorry] I'm not longer so sure I agree!

In flex we have a separate Codec class that's responsible 

Composing posts for both JIRA and email (was a JIRA post)

2010-03-04 Thread Marvin Humphrey
(CC to lucy-dev and general, reply-to set to general)

On Thu, Mar 04, 2010 at 06:18:28AM +, Shai Erera (JIRA) wrote:

 (Warning, this post is long, and is easier to read in JIRA)

I consume email from many of the Lucene lists, and I hate it when people force
me to read stuff via JIRA.  It slows me down to have to jump to all those
forum web pages.  I only go the web page if there are 5 or more posts in a row
on the same issue that I need to read.

For what it's worth, I've worked out a few routines that make it possible to
compose messages which read well in both mediums.

  * Never edit your posts unless absolutely necessary.  If JIRA used diffs,
things would be different, but instead it sends the whole frikkin' post 
twice (before and after), which makes it very difficult to see what was
edited.  If you must edit, append an edited: block at the end to
describe what you changed instead of just making changes inline.
  * Use FireFox and the It's All Text plugin, which makes it possible to edit
JIRA posts using an external editor such as Vim instead of typing into a
textarea. http://trac.gerf.org/itsalltext
  * After editing, use the preview button (it's a little monitor icon to the
upper right of the textarea) to make sure the post looks good in JIRA.
  * Use   for quoting instead of JIRA's bq. and {quote} since JIRA's
mechanisms look so crappy in email.  This is easy from Vim, because
rewrapping a long line (by typing gq from visual mode to rewrap the
current selection) that starts with   causes   to be prepended to
the wrapped lines.
  * Use asterisk bullet lists liberally, because they look good everywhere.
  * Use asterisks for *emphasis*, because that looks good everywhere.
  * If you wrap lines, use a reasonably short line length.  (I use 78; Mike
McCandless, who also wraps lines for his Jira posts, uses a smaller
number).  Otherwise you'll get nasty wrapping in narrow windows, both in
email clients and web browsers.

There are still a couple compromises that don't work out well.  For email,
ideally you want to set off code blocks with indenting:

int foo = 1;
int bar = 2;

To make code look decent in JIRA, you have to wrap that with {code} tags,
which unfortunately look heinous in email.  Left-justifying the tags but
indenting the code seems like it would be a rotten-but-salvageable compromise,
as it at least sets off the tags visually rather than making them appear as
though they are part of the code fragment.

{code}
int foo = 1;
int bar = 2;
{code}

Unfortunately, that's going to look like this in JIRA, because of a bug that
strips all leading whitespace from the first line.

   |-|
   | int foo;|
   | int bar;|
   |-|

It seems that this has been fixed by Atlassian in the Confluence wiki
(http://jira.atlassian.com/browse/CONF-4548), but the issue remains for the
JIRA installation at issues.apache.org.  So for now, I manually strip
indentation until the whole block is flush left.

{code}
int foo = 1;
int bar = 2;
{code}

(Gag.  I vastly prefer wikis that automatically apply fixed-width styling to
any indented text.)

One last tip for Lucy developers (and other non-Java devs).  JIRA has limited
syntax highlighting support -- Java, JavaScript, ActionScript, XML and SQL
only -- and defaults to assuming your code is Java.  In general, you want to
override that and tell JIRA to use none.

{code:none}
int foo = 1;
int bar = 2;
{code}

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841395#action_12841395
 ] 

Michael McCandless commented on LUCENE-2293:


bq. Cool! I wasn't sure if you wanted to give them private doc stores too. +1, 
I like it.

I wasn't sure either ;)  Ie, I forgot about that aspect of my proposal until it 
was raised in the discussion... but I think that'd be necessary.

This will be a perf hit, when building up a big new index.  But since doc 
stores now merge by bulk copy (when there are no deletions) hopefully the 
impact isn't too much.  And, hopefully it's more than made up for by the 
improvement in IO/CPU interleaved concurrency.

I'll work out a patch to at least make the hardwired 5 configurable... but does 
anyone out there wanna work out the private RAM segments?

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841407#action_12841407
 ] 

Michael Busch commented on LUCENE-2293:
---

bq. Yes, I think each DW will have to record its own buffered delete 
Term/Query, mapping to its docID at the time the delete arrived. 

I think in the future deletes in DW could work like this:
- DW keeps of course track of a private sequence id, which gets incremented in 
the add, delete, update calls
- a DW has a getReader() call, the reader can search the ram buffer
- when DW.gerReader() gets called, then the new reader remembers the current 
seqID at the time it was opened - let's call it RAMReader.seqID; if such a 
reader gets reopened, simply its seqID gets updated.
- we keep an growing int array with the size of DW's maxDoc, which replaces the 
usual deletes bitset
- when DW.updateDocument() or .deleteDocument() needs to delete a doc we do 
that right away, before inverting the new doc. We can do that by running a 
query using a RAMReader to find all docs that must be deleted. Instead of 
flipping a bit in a bitset, for each hit we now keep track of when it was 
deleted:

{code}
// init each slot in deletes array with -1
static final int NOT_DELETED = Integer.MAX_INT;
...
Arrays.fill(deletes, NOT_DELETED);

...

public void deleteDocument(Query q) {
  reopen RAMReader
  run query q using RAMReader
  for each hit {
int hitDocId = ...
if (deletes[hitDocId] == NOT_DELETED) {
  deletes[hitDocId] = DW.seqID;
}
  }
...
  DW.seqID++;
}
{code}

Now no matter of how often you (re)open RAMReaders, they can share the deletes 
array. No cloning like with the BitSet approach would be necessary:

When the RAMReader iterates posting lists it's as simple as this to treat 
deletes docs correctly. Instead of doing this in RAMTermDocs.next():
{code}
  if (deletedDocsBitSet.get(doc)) {
skip this doc
 }
{code}

we can now do:

{code}
  if (deletes[doc]  ramReader.seqID) {
skip this doc
  }
{code}

Here is an example:
1. Add 3 docs with DW.addDocument() 
2. User opens ramReader_a
3. Delete doc 1
4. User opens ramReader_b


After 1: DW.seqID = 2; deletes[]={MAX_INT, MAX_INT, MAX_INT}
After 2: ramReader_a.seqID = 2
After 3: DW.seqID = 3; deletes[]={MAX_INT, 2, MAX_INT}
After 3: ramReader_b.seqID = 3

Note that both ramReader_a and ramReader_b share the same deletes[] array. Now 
when ramReader_a is used to read posting lists, it will not treat doc 1 as 
deleted, because (deletes[1]  ramReader_a.seqID) = (2  2) = false; But 
ramReader_b will see it as deleted, because (deletes[1]  ramReader_b.seqID) = 
(2  3) = true.

What do you think about this approach for the future when we have a searchable 
DW buffer?

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately 

Re: Composing posts for both JIRA and email (was a JIRA post)

2010-03-04 Thread Simon Willnauer
Marvin,

thank you for taking the time to write up this great guidelines. Would
you mind adding this to the wiki? I think
this is very valuable for new devs and contributors.

simon

On Thu, Mar 4, 2010 at 6:28 PM, Marvin Humphrey mar...@rectangular.com wrote:
 (CC to lucy-dev and general, reply-to set to general)

 On Thu, Mar 04, 2010 at 06:18:28AM +, Shai Erera (JIRA) wrote:

 (Warning, this post is long, and is easier to read in JIRA)

 I consume email from many of the Lucene lists, and I hate it when people force
 me to read stuff via JIRA.  It slows me down to have to jump to all those
 forum web pages.  I only go the web page if there are 5 or more posts in a row
 on the same issue that I need to read.

 For what it's worth, I've worked out a few routines that make it possible to
 compose messages which read well in both mediums.

  * Never edit your posts unless absolutely necessary.  If JIRA used diffs,
    things would be different, but instead it sends the whole frikkin' post
    twice (before and after), which makes it very difficult to see what was
    edited.  If you must edit, append an edited: block at the end to
    describe what you changed instead of just making changes inline.
  * Use FireFox and the It's All Text plugin, which makes it possible to edit
    JIRA posts using an external editor such as Vim instead of typing into a
    textarea. http://trac.gerf.org/itsalltext
  * After editing, use the preview button (it's a little monitor icon to the
    upper right of the textarea) to make sure the post looks good in JIRA.
  * Use   for quoting instead of JIRA's bq. and {quote} since JIRA's
    mechanisms look so crappy in email.  This is easy from Vim, because
    rewrapping a long line (by typing gq from visual mode to rewrap the
    current selection) that starts with   causes   to be prepended to
    the wrapped lines.
  * Use asterisk bullet lists liberally, because they look good everywhere.
  * Use asterisks for *emphasis*, because that looks good everywhere.
  * If you wrap lines, use a reasonably short line length.  (I use 78; Mike
    McCandless, who also wraps lines for his Jira posts, uses a smaller
    number).  Otherwise you'll get nasty wrapping in narrow windows, both in
    email clients and web browsers.

 There are still a couple compromises that don't work out well.  For email,
 ideally you want to set off code blocks with indenting:

    int foo = 1;
    int bar = 2;

 To make code look decent in JIRA, you have to wrap that with {code} tags,
 which unfortunately look heinous in email.  Left-justifying the tags but
 indenting the code seems like it would be a rotten-but-salvageable compromise,
 as it at least sets off the tags visually rather than making them appear as
 though they are part of the code fragment.

 {code}
    int foo = 1;
    int bar = 2;
 {code}

 Unfortunately, that's going to look like this in JIRA, because of a bug that
 strips all leading whitespace from the first line.

   |-|
   | int foo;                |
   |     int bar;            |
   |-|

 It seems that this has been fixed by Atlassian in the Confluence wiki
 (http://jira.atlassian.com/browse/CONF-4548), but the issue remains for the
 JIRA installation at issues.apache.org.  So for now, I manually strip
 indentation until the whole block is flush left.

 {code}
 int foo = 1;
 int bar = 2;
 {code}

 (Gag.  I vastly prefer wikis that automatically apply fixed-width styling to
 any indented text.)

 One last tip for Lucy developers (and other non-Java devs).  JIRA has limited
 syntax highlighting support -- Java, JavaScript, ActionScript, XML and SQL
 only -- and defaults to assuming your code is Java.  In general, you want to
 override that and tell JIRA to use none.

 {code:none}
 int foo = 1;
 int bar = 2;
 {code}

 Marvin Humphrey



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841463#action_12841463
 ] 

Shai Erera commented on LUCENE-2293:


What about the following scenario:
# A document is added w/ term A to DW1
# A document is added w/ term A to DW2 (by another thread)
# A deleteDocuments(Term-A) is issued against DW1 (could be even 3, where A 
does not exist)

I thought that when (3) happens, the delete-by-term needs to be issued against 
all DWs, so that later when they apply their deletes they'll *remember* to do 
so. Issuing that against all DWs will record the docID of each DW up until 
which the delete should apply.

We could move to doing the delete right-away, by reopening a DW reader, and we 
could move to storing deletes in int[] rather than bit set. But I'm not sure I 
understand how your proposal will handle the scenario I've described.

Also, I don't see the advantage of moving to store the deletes in int[] rather 
than bitset ... is it just to avoid calling the get(doc)?

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Request for clarification on unordered SpanNearQuery

2010-03-04 Thread Goddard, Michael J.
Paul (and Mark),

Thank you for answering.  Do you suppose not really straightforward means 40 
hours or something like that?  I'm just trying to get an idea of whether what 
I'm attempting is worth the effort.

  Mike


-Original Message-
From: java-dev-return-47351-michael.j.goddard=saic@lucene.apache.org on 
behalf of Paul Elschot
Sent: Thu 3/4/2010 11:51 AM
To: java-dev@lucene.apache.org
Subject: Re: Request for clarification on unordered SpanNearQuery
 
Michael,

The test for the 4th range fails because the first matching subspans
(for t1 in this case) is always the one that is first advanced, and the first
match at that point has a less slop (0) than the maximum allowed (1)
so one might actually try and advance another subspans first.
But that is not really straightforward to implement, especially when different
terms can be indexed in the same position.

Perhaps the javadocs for the unordered case should be improved to mention
that in the unordered case the first subspans is always the one that is
advanced first.

Regards,
Paul Elschot

Op donderdag 04 maart 2010 17:34:26 schreef Goddard, Michael J.:
 I've been working on some highlighting changes involving Spans 
 (https://issues.apache.org/jira/browse/LUCENE-2287) and could use some help 
 understanding when overlapping Spans are valid.  To illustrate, I added the 
 test below to the TestSpans class; this test fails because there is no fourth 
 range.
 
 Am I wrong in my expectation that that last range would match?
 
 Thanks.
 
   Mike
 
 
   // Doc 11 contains t1 t2 t1 t3 t2 t3
   public void testSpanNearUnOrderedOverlap() throws Exception {
 boolean ordered = false;
 int slop = 1;
 SpanNearQuery snq = new SpanNearQuery(
   new SpanQuery[] {
 makeSpanTermQuery(t1),
 makeSpanTermQuery(t2),
 makeSpanTermQuery(t3) },
   slop,
   ordered);
 Spans spans = snq.getSpans(searcher.getIndexReader());
 
 assertTrue(first range, spans.next());
 assertEquals(first doc, 11, spans.doc());
 assertEquals(first start, 0, spans.start());
 assertEquals(first end, 4, spans.end());
 
 assertTrue(second range, spans.next());
 assertEquals(second doc, 11, spans.doc());
 assertEquals(second start, 1, spans.start());
 assertEquals(second end, 4, spans.end());
 
 assertTrue(third range, spans.next());
 assertEquals(third doc, 11, spans.doc());
 assertEquals(third start, 2, spans.start());
 assertEquals(third end, 5, spans.end());
 
 // Question: why wouldn't this Span be found?
 assertTrue(fourth range, spans.next());
 assertEquals(fourth doc, 11, spans.doc());
 assertEquals(fourth start, 2, spans.start());
 assertEquals(fourth end, 6, spans.end());
 
 assertFalse(fifth range, spans.next());
   }
 
 

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


winmail.dat
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841545#action_12841545
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
I thought that when (3) happens, the delete-by-term needs to be issued against 
all DWs, so that later when they apply their deletes they'll remember to do so. 
Issuing that against all DWs will record the docID of each DW up until which 
the delete should apply.
{quote}

Yes, you still need to apply deletes on all DWs. My approach is not different 
in that regard.

{quote}
Also, I don't see the advantage of moving to store the deletes in int[] rather 
than bitset ... is it just to avoid calling the get(doc)?
{quote}

The big advantage is that all (re)opened readers can share the single int[] 
array. If you use a bitset you need to clone it for each reader. With the int[] 
reopening becomes basically free from a deletes perspective.

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841574#action_12841574
 ] 

Earwin Burrfoot commented on LUCENE-2294:
-

I voted for killing these delegating methods some time ago. It ended in 
nothing, so I vote again, #3 :)

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841595#action_12841595
 ] 

Yonik Seeley commented on LUCENE-2294:
--

Yay, we'll be able to remove SolrIndexConfig and use this :-)

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841617#action_12841617
 ] 

Michael Busch commented on LUCENE-2293:
---

bq. The big advantage is that all (re)opened readers can share the single int[] 
array.

Dirty reads will be a problem with sharing the array. An AtomicIntegerArray 
could be used. We need to experiment how expensive that would be. 

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841694#action_12841694
 ] 

Shai Erera commented on LUCENE-2294:


Ok, then I'll proceed w/ #3.

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org