[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857164#action_12857164
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
It's for performance. I expect there are apps where a given
thread/pool indexes certain kind of docs, ie, the app threads
themselves have "affinity" for docs with similar term distributions.
In which case, it's best (most RAM efficient) if those docs w/
presumably similar term stats are sent back to the same DW. If you
mix in different term stats into one buffer you get worse RAM
efficiency.
{quote}

I do see your point, but I feel like we shouldn't optimize/make compromises for 
this use case.  Mainly, because I think apps with such an affinity that you 
describe are very rare?  The usual design is a queued ingestion pipeline, where 
a pool of indexer threads take docs out of a queue and feed them to an 
IndexWriter, I think?  In such a world the threads wouldn't have an affinity 
for similar docs.

And if a user really has so different docs, maybe the right answer would be to 
have more than one single index?  Even if today an app utilizes the thread 
affinity, this only results in maybe somewhat faster indexing performance, but 
the benefits would be lost after flusing/merging.  

If we assign docs randomly to available DocumentsWriterPerThreads, then we 
should on average make good use of the overall memory?  Alternatively we could 
also select the DWPT from the pool of available DWPTs that has the highest 
amount of free memory?  

Having a fully decoupled memory management is compelling I think, mainly 
because it makes everything so much simpler.  A DWPT could decide itself when 
it's time to flush, and the other ones can keep going independently.  

If you do have a global RAM management, how would the flushing work?  E.g. when 
a global flush is triggered because all RAM is consumed, and we pick the DWPT 
with the highest amount of allocated memory for flushing, what will the other 
DWPTs do during that flush?  Wouldn't we have to pause the other DWPTs to make 
sure we don't exceed the maxRAMBufferSize?
Of course we could say "always flush when 90% of the overall memory is 
consumed", but how would we know that the remaining 10% won't fill up during 
the time the flush takes?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1698) Change backwards-compatibility policy

2010-04-14 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-1698:
-

Assignee: (was: Michael Busch)

:)

> Change backwards-compatibility policy
> -
>
> Key: LUCENE-1698
> URL: https://issues.apache.org/jira/browse/LUCENE-1698
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
>
> These proposed changes might still change slightly:
> I'll call X.Y -> X+1.0 a 'major release', X.Y -> X.Y+1 a
> 'minor release' and X.Y.Z -> X.Y.Z+1 a 'bugfix release'. (we can later
> use different names; just for convenience here...)
> 1. The file format backwards-compatiblity policy will remain unchanged;
>i.e. Lucene X.Y supports reading all indexes written with Lucene
>X-1.Y. That means Lucene 4.0 will not have to be able to read 2.x
>indexes.
> 2. Deprecated public and protected APIs can be removed if they have
>been released in at least one major or minor release. E.g. an 3.1
>API can be released as deprecated in 3.2 and removed in 3.3 or 4.0
>(if 4.0 comes after 3.2).
> 3. No public or protected APIs are changed in a bugfix release; except
>if a severe bug can't be changed otherwise.
> 4. Each release will have release notes with a new section
>"Incompatible changes", which lists, as the names says, all changes that
>break backwards compatibility. The list should also have information
>about how to convert to the new API. I think the eclipse releases
>have such a release notes section. Furthermore, the Deprecation tag 
>comment will state the minimum version when this API is to be removed,  
> e.g.
>@deprecated See #fooBar().  Will be removed in 3.3 
>or
>@deprecated See #fooBar().  Will be removed in 3.3 or later.
> I'd suggest to treat a runtime change like an API change (unless it's fixing 
> a bug of course),
> i.e. giving a warning, providing a switch, switching the default behavior 
> only after a major 
> or minor release was around that had the warning/switch. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-14 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2324:
--

Attachment: lucene-2324.patch

The patch removes all *PerThread classes downstream of DocumentsWriter.

This simplifies a lot of the flushing logic in the different consumers.  The 
patch also removes FreqProxMergeState, because we don't have to interleave 
posting lists from different threads anymore of course.  I really like these 
simplifications!

There is still a lot to do:  The changes in DocumentsWriter and IndexWriter are 
currently just experimental to make everything compile.  Next I will introduce 
DocumentsWriterPerThread and implement the sequenceID logic (which was 
discussed here in earlier comments) and the new RAM management.  I also want to 
go through the indexing chain once again - there are probably a few more things 
to clean up or simplify.

The patch compiles and actually a surprising amount of tests pass.  Only 
multi-threaded tests seem to fail,
which is not very surprising, considering I removed all thread-handling logic 
from DocumentsWriter. :) 

So this patch isn't working yet - just wanted to post my current progress.  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1879) Parallel incremental indexing

2010-04-09 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855377#action_12855377
 ] 

Michael Busch commented on LUCENE-1879:
---

{quote}
I'll start by describing the limitations of the current design (whether its the 
approach or the code is debatable):
{quote}

FWIW:  The attached code and approach was never meant to be committed.  I 
attached it for legal reasons, as it contains the IP that IBM donated to Apache 
via the software grant.  Apache requires to attach the code that is covered by 
such a grant.

I wouldn't want the master/slave approach in Lucene core.  You can implement it 
much nicer *inside* of Lucene.  The attached code however was developed with 
the requirement of having to run on top of an unmodified Lucene version.  

{quote}
I've realized this when I found that if tests (in this patch) are run with 
"-ea", there are many assert exceptions that are printed from 
IndexWriter.startCommit.
{quote}

The code runs without exceptions with Lucene 2.4.  It doesn't work with 
2.9/3.0, but you'll find an upgraded version that works with 3.0 within IBM, 
Shai.

> Parallel incremental indexing
> -
>
> Key: LUCENE-1879
> URL: https://issues.apache.org/jira/browse/LUCENE-1879
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
> Fix For: 3.1
>
> Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync 
> on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-05 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853751#action_12853751
 ] 

Michael Busch commented on LUCENE-2324:
---

Sorry, Jason, I got sidetracked with LUCENE-2329 and other things at work.  
I'll try to write the sequence ID stuff asap.  However, there's more we need to 
do here that is sort of independent of the deleted docs problem.  E.g. removing 
all the downstream perThread classes.   

We should work with the flex code from now on, as the flex branch will be 
merged into trunk soon.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-04-05 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853509#action_12853509
 ] 

Michael Busch commented on LUCENE-2329:
---

We could move the if (postingsArray == null) check to start(), then we don't 
have to check for every new term?



> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329-2.patch, LUCENE-2329.patch, 
> LUCENE-2329.patch, LUCENE-2329.patch, lucene-2329.patch, lucene-2329.patch, 
> lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-04-02 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852858#action_12852858
 ] 

Michael Busch commented on LUCENE-2329:
---

Thanks!  I think we can resolve this now?

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329-2.patch, LUCENE-2329.patch, 
> LUCENE-2329.patch, lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-04-01 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852625#action_12852625
 ] 

Michael Busch commented on LUCENE-2329:
---

Looks great!  I like the removal of bytesAlloc - nice simplification.

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329-2.patch, LUCENE-2329.patch, 
> LUCENE-2329.patch, lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-31 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2329:
--

Attachment: lucene-2329-2.patch

This patch:
 * Changes DocumentsWriter to trigger the flush using bytesAllocated instead of 
bytesUsed to improve the "running hot" issue Mike's seeing
 * Improves the ParallelPostingsArray to grow using ArrayUtil.oversize()

In IRC we discussed changing TermsHashPerField to shrink the parallel arrays in 
freeRAM(), but that involves tricky thread-safety changes, because one thread 
could call DocumentsWriter.balanceRAM(), which triggers freeRAM() across *all* 
thread states, while other threads keep indexing.

We decided to leave it the way it currently works: we discard the whole 
parallel array during flush and don't reuse it.  This is not as optimal as it 
could be, but once LUCENE-2324 is done this won't be an issue anymore anyway.

Note that this new patch is against the flex branch: I thought we'd switch it 
over soon anyway?  I can also create a patch for trunk if that's preferred.

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329-2.patch, lucene-2329.patch, 
> lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2010-03-30 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-2126.
---

Resolution: Fixed

Committed revision 929340.

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch, lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2111) Wrapup flexible indexing

2010-03-30 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851452#action_12851452
 ] 

Michael Busch commented on LUCENE-2111:
---

bq. Flex is generally faster.

Awesome work!  What changes make those queries run faster with the default 
codec?  Mostly terms dict changes and automaton for fuzzy/wildcard?

How's the indexing performance?


bq. I think net/net we are good to land flex!

+1!  Even if there are still small things to change/fix I think it makes sense 
to merge with trunk now.


> Wrapup flexible indexing
> 
>
> Key: LUCENE-2111
> URL: https://issues.apache.org/jira/browse/LUCENE-2111
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: benchUtil.py, flex_backwards_merge_912395.patch, 
> flex_merge_916543.patch, flexBench.py, LUCENE-2111-EmptyTermsEnum.patch, 
> LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_bytesRef.patch, 
> LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch, 
> LUCENE-2111_mtqNull.patch, LUCENE-2111_mtqTest.patch, 
> LUCENE-2111_toString.patch
>
>
> Spinoff from LUCENE-1458.
> The flex branch is in fairly good shape -- all tests pass, initial search 
> performance testing looks good, it survived several visits from the Unicode 
> policeman ;)
> But it still has a number of nocommits, could use some more scrutiny 
> especially on the "emulate old API on flex index" and vice/versa code paths, 
> and still needs some more performance testing.  I'll do these under this 
> issue, and we should open separate issues for other self contained fixes.
> The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2010-03-30 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851451#action_12851451
 ] 

Michael Busch commented on LUCENE-2126:
---

I'll try to commit tonight to flex, but it'll probably be tomorrow (I think I 
have to update the patch, cause there were some changes to IndexInput/Output).  
If you want to merge flex into trunk sooner I can also just commit this 
afterwards to trunk.

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch, lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-29 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851142#action_12851142
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
The clarify, the apply deletes doc id up to will be the flushed doc count saved 
per term/query per DW, though it won't be saved, it'll be derived from the 
sequence id int array where the action has been encoded into the seq id int?
{quote}

Yeah, that's the idea.  Let's see if it works :)

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-29 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851078#action_12851078
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
I'm not sure we need that level of complexity just yet? How
would we make the transaction log memory efficient?
{quote}

Is that really so complex?  You only need one additional int per doc in the 
DWPTs, and the global map for the delete terms.  You don't need to buffer the 
actual terms per DWPT.  I thought that's quite efficient?  But I'm totally open 
to other ideas.

I can try tonight to code a prototype of this - I don't think it would be very 
complex actually.  But of course there might be complications I haven't thought 
of.

bq.  Are there other uses you foresee?

Not really for the "transaction log" as you called it.  I'd remove that log 
once we switch to deletes in the FG (when the RAM buffer is searchable).  But a 
nice thing would be for add/update/delete to return the seqID, and also the if 
RAMReader in the future had an API to check up to which seqID it's able to 
"see".  Then it's very clear to user of the API where a given reader is at.  
For this to work we have to assign the seqID at the *end* of a call.  E.g. when 
adding a large document, which takes a long time to process, it should get the 
seqID assigned after the "work" is done and right before the addDocument() call 
returns.  



> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-29 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850989#action_12850989
 ] 

Michael Busch commented on LUCENE-2329:
---

Good catch!

Thanks for the thorough explanation and suggestions.  I think it all makes 
sense.  Will work on a patch.

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-28 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850792#action_12850792
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
However, in the apply deletes
method how would we know which doc to stop deleting at? How
would the seq id map to a DW's doc id?
{quote}

We could have a global deletes-map that stores seqID -> DeleteAction.  
DeleteAction either contains a Term or a Query, and in addition an int 
"flushCount" (I'll explain in a bit what flushCount is used for.)

Each DocumentsWriterPerThread would have a growing array that contains each 
seqID that "affected" that DWPT, i.e. the seqIDs of *all* deletes, plus the 
seqIDs of the adds/updates performed by that particular DWPT.  One bit of a 
seqID in that array can indicate if it's a delete or add/update.

When it's time to flush we sort the array by increasing seqID and then loop a 
single time through it to find the seqIDs of all DeleteActions.  During the 
loop we count the number of adds/updates to determine the number of docs the 
DeleteActions affect.  After applying the deletes the DWPT makes a synchronized 
call to the global deletes-map and increments the flushCount int for each 
applied DeleteAction.  If flushCount==numThreadStates (== number of DWPT 
instances) the corresponding DeleteAction entry can be removed, because it was 
applied to all DWPT.

I think this should work?  Or is there a simpler solution?


> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-28 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850766#action_12850766
 ] 

Michael Busch commented on LUCENE-2324:
---

bq. I think for this same reason the ThreadBinder should have affinity

Mike, can you explain what the advantages of this kind of thread affinity are?  
I was always wondering why the DocumentsWriter code currently makes efforts to 
assign a ThreadState always to the same Thread?  Is that being done for 
performance reasons?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-28 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850760#action_12850760
 ] 

Michael Busch commented on LUCENE-2324:
---

Yes, we would need to buffer terms/queries per DW and also per DW the 
BufferedDeletes.Num.  The docID spaces in two DWs will be completely 
independent of each other after this change.


One potential problem that we (I think) even today have is the following: If 
you index with multiple threads, and then call e.g. deleteDocuments(Term) with 
one of the indexer threads while you keep adding documents with the other 
threads, it's not clear to the caller when exactly the deleteDocuments(Term) 
will happen.  It depends on the thread scheduling. 

Going back to the idea I mentioned here:
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841407&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841407

I mentioned the idea of having a sequence ID, that gets incremented on add, 
delete, update.  What if we had even with separate DWs a global sequence ID?  
The sequence ID would tell you unambiguously which action happened when.  The 
add/update/delete methods could return the sequenceID that was assigned to that 
particular action.  

Then we could e.g. track the delete terms globally together with the sequenceID 
of the corresponding delete call, while we still apply deletes during flush.  
Since sequenceIDs enforce a strict ordering we can figure out to how many docs 
per DW we need to apply the delete terms.

Later when we switch to real-time deletes (when the RAM is searchable) we will 
simply store the sequenceIDs in the deletes int[] array which I mentioned in my 
comment on LUCENE-2293.

Does this make sense?

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-26 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850312#action_12850312
 ] 

Michael Busch commented on LUCENE-2324:
---

bq. Not all apps index only 140 character docs from all threads 

What a luxury! :)

{quote}
I think for this same reason the ThreadBinder should have affinity, ie, try to 
schedule the same thread to the same DW, assuming it's free. If it's not free 
and another DW is free you should use the other one.
{quote}

If you didn't have such an affinity but use a random assignment of DWs to 
threads, would that balance the RAM usage across DWs without a global RAM 
management?

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1879) Parallel incremental indexing

2010-03-26 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850268#action_12850268
 ] 

Michael Busch commented on LUCENE-1879:
---

LUCENE-2324 will be helpful to support multi-threaded parallel-indexing.  If we 
have single-threaded DocumentsWriters, then it should be easy to have a 
ParallelDocumentsWriter? 

> Parallel incremental indexing
> -
>
> Key: LUCENE-1879
> URL: https://issues.apache.org/jira/browse/LUCENE-1879
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
> Fix For: 3.1
>
> Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync 
> on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-26 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850265#action_12850265
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
I'm not sure how we'd enforce the number of threads? Or we'd
have to re-implement the wait system implemented in DW? 
{quote}

I was thinking we were going to do that... having a fixed number of 
DocumentsWriterPerThread instances, and a ThreadBinder that let's a thread wait 
if the perthread is not available.  You don't need to interleave docIds then?  


> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-26 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850262#action_12850262
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
But if 1 thread tends to index lots of biggish docs... don't we want to allow 
it to use up more than 1/nth?
Ie we don't want to flush unless total RAM usage has hit the limit?
{quote}

Sure that'd be the disadvantage.  But is that a realistic scenario?  That the 
"avg. document size per thread" differ significantly in an application?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-26 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850235#action_12850235
 ] 

Michael Busch commented on LUCENE-2324:
---

The easiest would be if each DocumentsWriterPerThread had a fixed buffer size, 
then they can flush fully independently and you don't need to manage RAM 
globally across threads.

Of course then you'd need two config parameters: number of concurrent threads 
and buffer size per thread.


> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2346) Explore other in-memory postinglist formats for realtime search

2010-03-25 Thread Michael Busch (JIRA)
Explore other in-memory postinglist formats for realtime search
---

 Key: LUCENE-2346
 URL: https://issues.apache.org/jira/browse/LUCENE-2346
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


The current in-memory posting list format might not be optimal for searching. 
VInt decoding performance and the lack of skip lists would arguably be the 
biggest bottlenecks.

For LUCENE-2312 we should investigate other formats.

Some ideas:
- PFOR or packed ints for posting slices?
- Maybe even int[] slices instead of byte slices? This would be great for 
search performance, but the additional memory overhead might not be acceptable.
- For realtime search it's usually desirable to evaluate the most recent 
documents first.  So using backward pointers instead of forward pointers and 
having the postinglist pointer point to the most recent docID in a list is 
something to consider.
- Skipping: if we use fixed-length postings ([packed] ints) we can do binary 
search within a slice.  We can also locate a pointer then without scanning and 
thus skip entire slices quickly.  Is that sufficient or would we need more 
skipping layers, so that it's possible to skip directly to particular slices?


It would be awesome to find a format that doesn't slow down "normal" indexing, 
but is very efficient for in-memory searches.  If we can't find such a fits-all 
format, we should have a separate indexing chain for real-time indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849899#action_12849899
 ] 

Michael Busch commented on LUCENE-2324:
---

Awesome!

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849819#action_12849819
 ] 

Michael Busch commented on LUCENE-2324:
---

Hey Jason,

Disregard my patch here.  I just experimented with removal of pooling, but then 
did LUCENE-2329 instead.  TermsHash and TermsHashPerThread are now much 
simpler, because all the pooling code is gone after 2329 was committed.  Should 
make it a little easier to get this patch done.

Sure it'd be awesome if you could provide a patch here.  I can help you, we 
should just frequently post patches here so that we don't both work on the same 
areas.



> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2324:
--

Attachment: (was: lucene-2324-no-pooling.patch)

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-23 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-2329.
---

Resolution: Fixed

Committed revision 926791.

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-23 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848855#action_12848855
 ] 

Michael Busch commented on LUCENE-2329:
---

Cool, will do!  Thanks for the review and good questions... and the whole idea! 
:)

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-23 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848827#action_12848827
 ] 

Michael Busch edited comment on LUCENE-2329 at 3/23/10 6:06 PM:


{quote}
They save the object header per-unique-term, and 4 bytes on 64bit JREs since 
the "pointer" is now an int and not a real pointer?
{quote}

We actually save on 64bit JVMs (which I used for my tests) 28 bytes per 
unique-term:

h4. Trunk:
{code}
// Why + 4*POINTER_NUM_BYTE below?
//   +1: Posting is referenced by postingsFreeList array
//   +3: Posting is referenced by hash, which
//   targets 25-50% fill factor; approximate this
//   as 3X # pointers
bytesPerPosting = consumer.bytesPerPosting() + 
4*DocumentsWriter.POINTER_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE;
  }

...
abstract class RawPostingList {
  final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 
3*DocumentsWriter.INT_NUM_BYTE;

...

  // Coarse estimates used to measure RAM usage of buffered deletes
  final static int OBJECT_HEADER_BYTES = 8;
  final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4;
{code}

This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes. 

h4. 2329:
{code}
//   +3: Posting is referenced by hash, which
//   targets 25-50% fill factor; approximate this
//   as 3X # pointers
bytesPerPosting = consumer.bytesPerPosting() + 
3*DocumentsWriter.INT_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
return ParallelPostingsArray.BYTES_PER_POSTING + 4 * 
DocumentsWriter.INT_NUM_BYTE;
  }

...

final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE;
{code}

This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes.


I checked how many bytes were allocated for postings when the first segment was 
flushed.  Trunk flushed after 6400 docs and had 103MB allocated for PostingList 
objects.  2329 flushed after 8279 docs and had 94MB allocated for the parallel 
arrays, and 74MB out of the 94MB were actually used.

The first docs in the wikipedia dataset seem pretty large with many unique 
terms.

I think this sounds reasonable?

  was (Author: michaelbusch):
{quote}
They save the object header per-unique-term, and 4 bytes on 64bit JREs since 
the "pointer" is now an int and not a real pointer?
{quote}

We actually save on 64bit JVMs (which I used for my tests) 28 bytes per posting:

h4. Trunk:
{code}
// Why + 4*POINTER_NUM_BYTE below?
//   +1: Posting is referenced by postingsFreeList array
//   +3: Posting is referenced by hash, which
//   targets 25-50% fill factor; approximate this
//   as 3X # pointers
bytesPerPosting = consumer.bytesPerPosting() + 
4*DocumentsWriter.POINTER_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE;
  }

...
abstract class RawPostingList {
  final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 
3*DocumentsWriter.INT_NUM_BYTE;

...

  // Coarse estimates used to measure RAM usage of buffered deletes
  final static int OBJECT_HEADER_BYTES = 8;
  final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4;
{code}

This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes. 

h4. 2329:
{code}
//   +3: Posting is referenced by hash, which
//   targets 25-50% fill factor; approximate this
//   as 3X # pointers
bytesPerPosting = consumer.bytesPerPosting() + 
3*DocumentsWriter.INT_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
return ParallelPostingsArray.BYTES_PER_POSTING + 4 * 
DocumentsWriter.INT_NUM_BYTE;
  }

...

final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE;
{code}

This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes.


I checked how many bytes were allocated for postings when the first segment was 
flushed.  Trunk flushed after 6400 docs and had 103MB allocated for PostingList 
objects.  2329 flushed after 8279 docs and had 94MB allocated for the parallel 
arrays, and 74MB out of the 94MB were actually used.

The first docs in the wikipedia dataset seem pretty large with many unique 
terms.

I think this sounds reasonable?
  
> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was 

[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-23 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848827#action_12848827
 ] 

Michael Busch commented on LUCENE-2329:
---

{quote}
They save the object header per-unique-term, and 4 bytes on 64bit JREs since 
the "pointer" is now an int and not a real pointer?
{quote}

We actually save on 64bit JVMs (which I used for my tests) 28 bytes per posting:

h4. Trunk:
{code}
// Why + 4*POINTER_NUM_BYTE below?
//   +1: Posting is referenced by postingsFreeList array
//   +3: Posting is referenced by hash, which
//   targets 25-50% fill factor; approximate this
//   as 3X # pointers
bytesPerPosting = consumer.bytesPerPosting() + 
4*DocumentsWriter.POINTER_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE;
  }

...
abstract class RawPostingList {
  final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 
3*DocumentsWriter.INT_NUM_BYTE;

...

  // Coarse estimates used to measure RAM usage of buffered deletes
  final static int OBJECT_HEADER_BYTES = 8;
  final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4;
{code}

This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes. 

h4. 2329:
{code}
//   +3: Posting is referenced by hash, which
//   targets 25-50% fill factor; approximate this
//   as 3X # pointers
bytesPerPosting = consumer.bytesPerPosting() + 
3*DocumentsWriter.INT_NUM_BYTE;

...

  @Override
  int bytesPerPosting() {
return ParallelPostingsArray.BYTES_PER_POSTING + 4 * 
DocumentsWriter.INT_NUM_BYTE;
  }

...

final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE;
{code}

This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes.


I checked how many bytes were allocated for postings when the first segment was 
flushed.  Trunk flushed after 6400 docs and had 103MB allocated for PostingList 
objects.  2329 flushed after 8279 docs and had 94MB allocated for the parallel 
arrays, and 74MB out of the 94MB were actually used.

The first docs in the wikipedia dataset seem pretty large with many unique 
terms.

I think this sounds reasonable?

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-23 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848782#action_12848782
 ] 

Michael Busch commented on LUCENE-2329:
---

{quote}
OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie 
it only holds RAM for the one doc, at a time.
{quote}

Man, my brain is lacking the TermVector synapses...

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-23 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848748#action_12848748
 ] 

Michael Busch commented on LUCENE-2329:
---

{quote}
so it's surprising the savings was so much that you get 22% fewer segments... 
are you sure there isn't a bug in the RAM usage accounting?
{quote}

Yeah it seems a bit suspicious.  I'll investigate.  But, keep in mind that 
TermVectors were enabled too.  And the number of "unique terms" in the 2nd 
TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash 
in each round that sum should be higher than numPostings from the first 
TermsHash. 

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848475#action_12848475
 ] 

Michael Busch edited comment on LUCENE-2329 at 3/23/10 12:51 AM:
-

I did some performance experiments:

I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs 
files, disabled any merging,  RAM buffer size = 200MB, single writer thread, 
TermVectors enabled.  

Test machine: MacBook Pro, 2.53 GHz Intel Core 2 Duo, 4 GB 1067 MHz DDR3, MacOS 
X 10.5.8.

h4. Results with -Xmx2000m:

|| || Write performance || Gain || Number of segments ||
| trunk | 833 docs/sec |  |  41 |
| trunk + parallel arrays | 869 docs/sec | {color:green} + 4.3% {color} | 32|


h4. Results with -Xmx256m:

|| || Write performance || Gain || Number of segments ||
| trunk | 467 docs/sec |  | 41 |  
| trunk + parallel arrays | 871 docs/sec | {color:green} +86.5% {color} | 32|

So I think these results are interesting and roughly as expected.  4.3% is a 
nice small performance gain.
But running the tests with a low heap shows how much cheaper the garbage 
collection becomes.  Setting IW's RAM buffer to 200MB and the overall heap to 
256MB forces the gc to run frequently.  The mark times are much more costly if 
we have all long-living PostingList objects in memory compared to parallel 
arrays.

So this is probably not a huge deal for "normal" indexing.  But once we can 
search on the RAM buffer it becomes much more attractive to fill up the RAM as 
much as you can.  And exactly in that case we safe a lot with this improvement. 

Also note that the number of segments decreased by 22% (from 41 to 32).  This 
shows that the parallel-array approach needs less RAM, thus flushes less often 
and will cause less segment merges in the long run.  So a longer test with 
actual segment merges would show even bigger gains (with both big and small 
heaps).

So overall, I'm very happy with these results!


  was (Author: michaelbusch):
I did some performance experiments:

I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs 
files, disabled any merging,  RAM buffer size = 200MB, single writer thread, 
TermVectors enabled.

h4. Results with -Xmx2000m:

|| || Write performance || Gain || Number of segments ||
| trunk | 833 docs/sec |  |  41 |
| trunk + parallel arrays | 869 docs/sec | {color:green} + 4.3% {color} | 32|


h4. Results with -Xmx256m:

|| || Write performance || Gain || Number of segments ||
| trunk | 467 docs/sec |  | 41 |  
| trunk + parallel arrays | 871 docs/sec | {color:green} +86.5% {color} | 32|

So I think these results are interesting and roughly as expected.  4.3% is a 
nice small performance gain.
But running the tests with a low heap shows how much cheaper the garbage 
collection becomes.  Setting IW's RAM buffer to 200MB and the overall heap to 
256MB forces the gc to run frequently.  The mark times are much more costly if 
we have all long-living PostingList objects in memory compared to parallel 
arrays.

So this is probably not a huge deal for "normal" indexing.  But once we can 
search on the RAM buffer it becomes much more attractive to fill up the RAM as 
much as you can.  And exactly in that case we safe a lot with this improvement. 

Also note that the number of segments decreased by 22% (from 41 to 32).  This 
shows that the parallel-array approach needs less RAM, thus flushes less often 
and will cause less segment merges in the long run.  So a longer test with 
actual segment merges would show even bigger gains (with both big and small 
heaps).

So overall, I'm very happy with these results!

  
> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a 

[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848475#action_12848475
 ] 

Michael Busch commented on LUCENE-2329:
---

I did some performance experiments:

I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs 
files, disabled any merging,  RAM buffer size = 200MB, single writer thread, 
TermVectors enabled.

h4. Results with -Xmx2000m:

|| || Write performance || Gain || Number of segments ||
| trunk | 833 docs/sec |  |  41 |
| trunk + parallel arrays | 869 docs/sec | {color:green} + 4.3% {color} | 32|


h4. Results with -Xmx256m:

|| || Write performance || Gain || Number of segments ||
| trunk | 467 docs/sec |  | 41 |  
| trunk + parallel arrays | 871 docs/sec | {color:green} +86.5% {color} | 32|

So I think these results are interesting and roughly as expected.  4.3% is a 
nice small performance gain.
But running the tests with a low heap shows how much cheaper the garbage 
collection becomes.  Setting IW's RAM buffer to 200MB and the overall heap to 
256MB forces the gc to run frequently.  The mark times are much more costly if 
we have all long-living PostingList objects in memory compared to parallel 
arrays.

So this is probably not a huge deal for "normal" indexing.  But once we can 
search on the RAM buffer it becomes much more attractive to fill up the RAM as 
much as you can.  And exactly in that case we safe a lot with this improvement. 

Also note that the number of segments decreased by 22% (from 41 to 32).  This 
shows that the parallel-array approach needs less RAM, thus flushes less often 
and will cause less segment merges in the long run.  So a longer test with 
actual segment merges would show even bigger gains (with both big and small 
heaps).

So overall, I'm very happy with these results!


> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848226#action_12848226
 ] 

Michael Busch commented on LUCENE-2312:
---

I think sync'ing after every doc is probably the better option.  We'll still 
avoid the need to make all variables downstream of DocumentsWriter 
volatile/atomic, which should be a nice performance gain.

The problem with the delayed sync'ing (after e.g. 100 docs) is that if you 
don't have a never-ending stream of twee... err documents, then you might want 
to force an explicit sync at some point.  But that's very hard, because you 
would have to force the writer thread to make e.g. a volatile write via an API 
call.  And if that's an IndexWriter writer API that has to trigger the sync on 
multiple DocumentsWriter instances (i.e. multiple writer threads) I don't see 
how that's possible unless Lucene manages it's own thread of pools.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848210#action_12848210
 ] 

Michael Busch edited comment on LUCENE-2312 at 3/22/10 5:01 PM:


bq. So.. what does this mean for allowing an IR impl to directly search IW's 
RAM buffer?

The main idea is to have an approach that is lock-free.  Then write performance 
will not suffer no matter how big your query load is.

When you open/reopen a RAMReader it would first ask the MemoryBarrier for the 
last sync'ed docID (volatile read).  This would be the maxDoc for that reader 
and it's safe for the reader to read up to that id, because it can be sure that 
all changes the writer thread made up to that maxDoc are visible to the reader.

If we called MemoryBarrier.sync() let's say every 100 docs, then the max. 
search latency would be the amount of time it takes to index 100 docs.  Doing 
no volatile/atomic writes and not going through explicit locks for 100 docs 
will allow the JVM to do all its nice optimizations.  I think this will work, 
but honestly I have not really a good feeling for how much performance this 
approach would gain compared to writing to volatile variables for every 
document.

  was (Author: michaelbusch):
bq. So.. what does this mean for allowing an IR impl to directly search 
IW's RAM buffer?

The main idea is to have an approach that is lock-free.  Then write performance 
will not suffer no matter how big your query load is.

When you open/reopen a RAMReader it would first ask the MemoryBarrier for the 
last sync'ed docID.  This would be the maxDoc for that reader and it's safe for 
the reader to read up to that id, because it can be sure that all changes the 
writer thread made up to that maxDoc are visible to the reader.

If we called MemoryBarrier.sync() let's say every 100 docs, then the max. 
search latency would be the amount of time it takes to index 100 docs.  Doing 
no volatile/atomic writes and not going through explicit locks for 100 docs 
will allow the JVM to do all its nice optimizations.  I think this will work, 
but honestly I have not really a good feeling for how much performance this 
approach would gain compared to writing to volatile variables for every 
document.
  
> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848210#action_12848210
 ] 

Michael Busch commented on LUCENE-2312:
---

bq. So.. what does this mean for allowing an IR impl to directly search IW's 
RAM buffer?

The main idea is to have an approach that is lock-free.  Then write performance 
will not suffer no matter how big your query load is.

When you open/reopen a RAMReader it would first ask the MemoryBarrier for the 
last sync'ed docID.  This would be the maxDoc for that reader and it's safe for 
the reader to read up to that id, because it can be sure that all changes the 
writer thread made up to that maxDoc are visible to the reader.

If we called MemoryBarrier.sync() let's say every 100 docs, then the max. 
search latency would be the amount of time it takes to index 100 docs.  Doing 
no volatile/atomic writes and not going through explicit locks for 100 docs 
will allow the JVM to do all its nice optimizations.  I think this will work, 
but honestly I have not really a good feeling for how much performance this 
approach would gain compared to writing to volatile variables for every 
document.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848198#action_12848198
 ] 

Michael Busch commented on LUCENE-2312:
---

Hi Brian - good to see you on this list!

In my previous comment I actually quoted some sections of the concurrency book:
https://issues.apache.org/jira/browse/LUCENE-2312?focusedCommentId=12845712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845712

Did I understand correctly that a volatile write can be used to enforce a 
cache->RAM write-through of *all* updates a thread made that came before the 
volatile write in the thread's program order?

My idea here was to use this behavior to avoid volatile writes for every 
document, but instead to periodically do such a volatile write (say e.g. every 
100 documents).  I implemented a class called MemoryBarrier, which keeps track 
of when the last volatile write happened.  A reader thread can ask the 
MemoryBarrier what the last successfully processed docID before crossing the 
barrier was.  The reader will then never attempt to read beyond that document.

Of course there are tons of details regarding safe publication of all involved 
fields and objects.  I was just wondering if this general "memory barrier" 
approach seems right and if indeed performance gains can be expected compared 
to doing volatile writes for every document?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-22 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2329:
--

Attachment: lucene-2329.patch

Removed reset().  All tests still pass.

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848161#action_12848161
 ] 

Michael Busch commented on LUCENE-2329:
---

bq. I think *ParallelPostingsArray.reset do not need to zero-fill the arrays - 
these are overwritten when that termID is first used, right?

Good point!  I'll remove the reset() methods.

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-22 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2329:
--

Attachment: lucene-2329.patch

Made the memory tracking changes as described in my previous comment.

All tests still pass.

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-22 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848058#action_12848058
 ] 

Michael Busch commented on LUCENE-2329:
---

One change I should make to the patch is how to track the memory consumption.  
When the parallel array is allocated or grown then bytesAllocated() should be 
called?  And when a new termID is added, should only then bytesUsed() be called?

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-22 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2329:
--

Attachment: lucene-2329.patch

Changes the indexer to use parallel arrays instead of PostingList objects (for 
both FreqProx* and TermVectors*).

All core & contrib & bw tests pass.  I haven't done performance tests yet.  

I'm wondering how to manage the size of the parallel array?  I started with an 
initial size for the parallel array equal to the size of the postingsHash 
array.  When it's full then I allocate a new one with 1.5x size.  When 
shrinkHash() is called I also shrink the parallel array to the same size as 
postingsHash.  How does that sound?

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-18 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847068#action_12847068
 ] 

Michael Busch commented on LUCENE-2329:
---

bq. Hmm the challenge is that the tracking done for term vectors is just within 
a single doc.

Duh! Of course you're right.


> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-18 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847024#action_12847024
 ] 

Michael Busch commented on LUCENE-2329:
---

bq. This issue is just about how IndexWriter's RAM buffer stores its terms... 

Actually, when I talked about the TermVectors I meant we should explore to 
store the termIDs on *disk*, rather than the strings.  It would help things 
like similarity search and facet counting.

{quote}
But, note that term vectors today do not store the term char[] again - they 
piggyback on the term char[] already stored for the postings.
{quote}

Yeah I think I'm familiar with that part (secondary entry point in 
TermsHashPerField, hashes based on termStart).  Haven't looked much into how 
the "rest" of the TermVector in-memory data structures are working.  

{quote}
Though, I believe they store "int textStart" (increments by term length per 
unique term), which is less compact than the termID would be (increments +1 per 
unique term)
{quote}

Actually we wouldn't need a second hashtable for the secondary TermsHash 
anymore, right?  It would just have like the primary TermsHash a parallel array 
with the things that the TermVectorsTermsWriter.Postinglist class currently 
contains (freq, lastOffset, lastPosition)?  And the index into that array would 
be the termID of course.

This would be a nice simplification, because no hash collisions, no hash table 
resizing based on load factor, etc. would be necessary for non-primary 
TermsHashes?

bq.  so if eg we someday use packed ints we'd be more RAM efficient by storing 
termIDs...

How does the read performance of packed ints compare to "normal" int[] arrays?  
I think nowadays RAM is less of an issue?  And with a searchable RAM buffer we 
might want to sacrifice a bit more RAM for higher search performance?  Oh man, 
will we need flexible indexing for the in-memory index too? :) 

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-17 Thread Michael Busch (JIRA)
Use parallel arrays instead of PostingList objects
--

 Key: LUCENE-2329
 URL: https://issues.apache.org/jira/browse/LUCENE-2329
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.

In order to avoid having very many long-living PostingList objects in 
TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
simply be a int[] which maps each term to dense termIDs.

All data that the PostingList classes currently hold will then we placed in 
parallel arrays, where the termID is the index into the arrays.  This will 
avoid the need for object pooling, will remove the overhead of object 
initialization and garbage collection.  Especially garbage collection should 
benefit significantly when the JVM runs out of memory, because in such a 
situation the gc mark times can get very long if there is a big number of 
long-living objects in memory.

Another benefit could be to build more efficient TermVectors.  We could avoid 
the need of having to store the term string per document in the TermVector.  
Instead we could just store the segment-wide termIDs.  This would reduce the 
size and also make it easier to implement efficient algorithms that use 
TermVectors, because no term mapping across documents in a segment would be 
necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-17 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2324:
--

Attachment: lucene-2324-no-pooling.patch

All tests pass but I have to review if with the changes the memory consumption 
calculation still works correctly. Not sure if the junits test that?

Also haven't done any performance testing yet.  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324-no-pooling.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-17 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846586#action_12846586
 ] 

Michael Busch commented on LUCENE-2324:
---

bq. Michael, Agreed, can you outline how you think we should proceed then?

Sorry for not responding earlier...

I'm currently working on removing the PostingList object pooling, because it 
makes TermsHash and TermsHashPerThread much easier.  Have written the patch and 
all tests pass, though I haven't done performance testing yet.  Making 
TermsHash and TermsHashPerThread smaller will also make the patch here easier 
which will remove them. I'll post the patch soon. 

Next steps I think here are to make everything downstream of DocumentsWriter 
single-threaded (removal of *PerThread) classes.  Then we need to write the 
DocumentsWriterThreadBinder and have to think about how to apply deletes, 
commits and rollbacks to all DocumentsWriter instances.  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-16 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846128#action_12846128
 ] 

Michael Busch commented on LUCENE-2324:
---

I think we all agree that we want to have a single writer thread, multi reader 
thread model.  Only then the thread-safety problems in LUCENE-2312 can be 
reduced to visibility (no write-locking).  So I think making this change first 
makes most sense.  It involves a bit boring refactoring work unfortunately. 

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-16 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846084#action_12846084
 ] 

Michael Busch commented on LUCENE-2324:
---

Shall we not first try to remove the downstream *PerThread classes and make the 
DocumentsWriter single-threaded without locking.  Then we add a 
PerThreadDocumentsWriter and DocumentsWriterThreadBinder, which talks to the 
PerThreadDWs and IW talks to DWTB.  We can pick other names :)

When that's done we can think about what kind of 
locking/synchronization/volatile stuff we need for LUCENE-2312.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-16 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845978#action_12845978
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
think we simply need a way to publish byte arrays to all
threads? Michael B. can you post something of what you have so
we can get an idea of how your system will work (ie, mainly what
the assumptions are)?
{quote}

It's kinda complicated to explain and currently differs from Lucene's TermHash 
classes a lot.  I'd prefer to wait a little bit until I have verified that my 
solution works.

I think here we should really tackle LUCENE-2324 first - it's a prereq.  Wanna 
help with that, Jason?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-16 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845969#action_12845969
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
I thought we're moving away from byte block pooling and we're
going to try relying on garbage collection? Does a volatile
object[] publish changes to all threads? Probably not, again
it'd just be the pointer.
{quote}

We were so far only considering moving away from pooling of (Raw)PostingList 
objects.  Pooling byte blocks might have more performance impact - they're more 
heavy-weight.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845745#action_12845745
 ] 

Michael Busch commented on LUCENE-2312:
---

The tricky part is to make sure that a reader always sees a consistent snapshot 
of the index.  At the same time a reader must not follow pointers to 
non-published locations (e.g. array blocks).

I think I have a lock-free solution working, which only syncs in certain 
intervals to not prevent JVM optimizations - but I need more time for thinking 
about all the combinations and corner cases.

It's getting late now - need to sleep!

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845745#action_12845745
 ] 

Michael Busch edited comment on LUCENE-2312 at 3/16/10 6:51 AM:


The tricky part is to make sure that a reader always sees a consistent snapshot 
of the index.  At the same time a reader must not follow pointers to 
non-published locations (e.g. array blocks).

I think I have a lock-free solution working, which only syncs (i.e. does 
volatile writes) in certain intervals to not prevent JVM optimizations - but I 
need more time for thinking about all the combinations and corner cases.

It's getting late now - need to sleep!

  was (Author: michaelbusch):
The tricky part is to make sure that a reader always sees a consistent 
snapshot of the index.  At the same time a reader must not follow pointers to 
non-published locations (e.g. array blocks).

I think I have a lock-free solution working, which only syncs in certain 
intervals to not prevent JVM optimizations - but I need more time for thinking 
about all the combinations and corner cases.

It's getting late now - need to sleep!
  
> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845731#action_12845731
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
Do volatile byte arrays work
{quote}

I'm not sure what you mean by volatile byte arrays?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845731#action_12845731
 ] 

Michael Busch edited comment on LUCENE-2312 at 3/16/10 6:12 AM:


{quote}
Do volatile byte arrays work
{quote}

I'm not sure what you mean by volatile byte arrays?

Do you mean this?
{code}
volatile byte[] array;
{code}

This makes the *reference* to the array volatile, not the slots in the array.

  was (Author: michaelbusch):
{quote}
Do volatile byte arrays work
{quote}

I'm not sure what you mean by volatile byte arrays?
  
> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845726#action_12845726
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
A quick and easy way to solve this is to use a read write lock
on the byte pool?
{quote}

If you use a RW lock then the writer thread will block all reader threads while 
it's making changes.  The writer thread will be making changes all the time in 
a real-time search environment.  The contention will kill performance I'm sure. 
 RW lock is only faster than mutual exclusion lock if writes are infrequent, as 
mentioned in the javadocs of ReadWriteLock.java

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845712#action_12845712
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote} Hmm... what does JMM say about byte arrays? If one thread is writing
to the byte array, can any other thread see those changes? 
{quote}

This is the very right question to ask here. Thread-safety is really the by
far most complicated aspect of this feature. Jason, I'm not sure if you
already figured out how to ensure visibility of changes made by the writer
thread to the reader threads?

Thread-safety in our case boils down to safe publication. We don't need
locking to coordinate writing of multiple threads, because of LUCENE-2324. But
we need to make sure that the reader threads see all changes they need to see
at the right time, in the right order. This is IMO very hard, but we all like
challenges :)

The JMM gives no guarantee whatsover what changes a thread will see that
another thread made - or if it will ever see the changes, unless proper
publication is ensured by either synchronization or volatile/atomic variables.

So e.g. if a writer thread executes the following statements:
{code}
public static int a, b;

...

a = 1; b = 2;

a = 5; b = 6;
{code}

and a reader threads does:
{code}
System.out.println(a + "," + b);
{code}

The thing to remember is that the output might be: 1,6! Another reader thread
with the following code: 
{code}
while (b != 6) {
  .. do something 
}
{code}
might further NEVER terminate without synchronization/volatile/atomic.

The reason is that the JVM is allowed to perform any reorderings to utilize
modern CPUs, memory, caches, etc. if not forced otherwise.

To ensure safe publication of data written by a thread we could do
synchronization, but my goal is it here to implement a non-blocking and
lock-free algorithm. So my idea was it to make use of a very subtle behavior
of volatile variables. I will take a simple explanation of the JMM from Brian
Goetz' awesome book "Java concurrency in practice", in which he describes the
JMM in simple happens-before rules. I will mention only three of those rules,
because they are enough to describe the volatile behavior I'd like to mention
here (p. 341)

*Program order rule:* Each action in a thread _happens-before_ every action in
that thread that comes later in the program order.

*Volatile variable rule:* A write to a volatile field _happens-before_ every
subsequent read of that same field.

*Transitivity:* If A happens-before B, and B _happens-before_ C, then A
_happens-before_ C.

Based on these three rules you can see that writing to a volatile variable v
by one thread t1 and subsequent reading of the same volatile variable v by
another thread t2 publishes ALL changes of t1 that happened-before the write
to v and the change of v itself. So this write/read of v means crossing a
memory barrier and forcing everything that t1 might have written to caches to
be flushed to the RAM. That's why a volatile write can actually be pretty
expensive.

Note that this behavior is actually only working like I just described since
Java 1.5. Behavior of volatile variables was a very very subtle change from
1.4->1.5!

The way I'm trying to make use of this behavior is actually similar to how we
lazily sync Lucene's files with the filesystem: I want to delay the cache->RAM
write-through as much as possible, which increases the probability of getting
the sync for free! Still fleshing out the details, but I wanted to share these
infos with you guys already, because it might invalidate a lot of assumptions
you might have when developing the code. Some of this stuff was actually new
to me, maybe you all know it already.  And if anything that I wrote here is
incorrect, please let me know!

Btw: IMO, if there's only one java book you can ever read, then read Goetz'
book! It's great. He also says in the book somewhere about lock-free
algorithms: "Don't try this at home!" - so, let's do it! :)

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845703#action_12845703
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
Sounds like awesome progress!! Want some details over here :)
{quote}

Sorry for not being very specific.  The prototype I'm experimenting with has a 
fixed length postings format for the in-memory representation (in TermsHash).  
Basically every posting has 4 bytes, so I can use int[] arrays (instead of the 
byte[] pools).  The first 3 bytes are used for an absolute docID (not 
delta-encoded). This limits the max in-memory segment size to 2^24 docs.  The 1 
remaining byte is used for the position.  With a max doc length of 140 
characters you can fit every possible position in a byte - what a luxury! :)  
If a term occurs multiple times in the same doc, then the TermDocs just skips 
multiple occurrences with the same docID and increments the freq.  Again, the 
same term doesn't occur often in super short docs.

The int[] slices also don't have forward pointers, like in Lucene's TermsHash, 
but backwards pointers.  In real-time search you often want a strongly 
time-biased ranking.  A PostingList object has a pointer that points to the 
last posting (this statement is not 100% correct for visibility reasons across 
threads, but we can imagine it this way for now).  A TermDocs can now traverse 
the postinglists in opposite order.  Skipping can be done by following pointers 
to previous slices directly, or by binary search within a slice.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845400#action_12845400
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
Sounds great - let's test it in practice.
{quote}

I have to admit that I need to catch up a bit on the flex branch.  I was 
wondering if it makes sense to make these kinds of experiments (pooling vs. 
non-pooling) with the flex code? Is it as fast as trunk already, or are there 
related nocommits left that affect indexing performance?  I would think not 
much of the flex changes should affect the in-memory indexing performance (in 
TermsHash*).


> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845398#action_12845398
 ] 

Michael Busch edited comment on LUCENE-2324 at 3/15/10 4:34 PM:


Reply to Mike's comment on LUCENE-2293: 
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12845263&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845263


{quote}
I think we can do even better, ie, that class wastes RAM for the single posting 
case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are 
not needed).

EG we could have a separate class dedicated to the singleton case. When term is 
first encountered it's enrolled there. We'd probably need a separate hash to 
store these (though not necessarily?). If it's seen again it's switched to the 
full posting.
{quote}

Hmm I think we'd need a separate hash.  Otherwise you have to subclass 
PostingList for the different cases (freq. vs. non-frequent terms) and do 
instanceof checks? Or with the parallel arrays idea maybe we could encode more 
information in the dense ID? E.g. use one bit to indicate if that term occurred 
more than once. 

{quote}
I mean instead of allocating an instance per unique term, we assign an integer 
ID (dense, ie, 0, 1, 2...).

And then we have an array for each member now in 
FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. 
Then to look up say the lastDocID for a given postingID you just get 
lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we 
can make these arrays paged... but that'd slow down each access.
{quote}

Yeah I like that idea. I've done something similar for representing trees - I 
had a very compact Node class with no data but such a dense ID, and arrays that 
stored the associated data.  Very easy to add another data type with no RAM 
overhead (you only use the amount of RAM the data needs).

Though, the price you pay is for dereferencing multiple times for each array?  
And how much RAM would we safe? The pointer for the PostingList object (4-8 
bytes), plus the size of the object header - how much is that in Java? 

Seems ilke it's 8 bytes: 
http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html

So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes 
(ID) = 8 bytes.  For fields with tons of unique terms that might be worth it?  

  was (Author: michaelbusch):
{quote}
I think we can do even better, ie, that class wastes RAM for the single posting 
case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are 
not needed).

EG we could have a separate class dedicated to the singleton case. When term is 
first encountered it's enrolled there. We'd probably need a separate hash to 
store these (though not necessarily?). If it's seen again it's switched to the 
full posting.
{quote}

Hmm I think we'd need a separate hash.  Otherwise you have to subclass 
PostingList for the different cases (freq. vs. non-frequent terms) and do 
instanceof checks? Or with the parallel arrays idea maybe we could encode more 
information in the dense ID? E.g. use one bit to indicate if that term occurred 
more than once. 

{quote}
I mean instead of allocating an instance per unique term, we assign an integer 
ID (dense, ie, 0, 1, 2...).

And then we have an array for each member now in 
FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. 
Then to look up say the lastDocID for a given postingID you just get 
lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we 
can make these arrays paged... but that'd slow down each access.
{quote}

Yeah I like that idea. I've done something similar for representing trees - I 
had a very compact Node class with no data but such a dense ID, and arrays that 
stored the associated data.  Very easy to add another data type with no RAM 
overhead (you only use the amount of RAM the data needs).

Though, the price you pay is for dereferencing multiple times for each array?  
And how much RAM would we safe? The pointer for the PostingList object (4-8 
bytes), plus the size of the object header - how much is that in Java? 

Seems ilke it's 8 bytes: 
http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html

So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes 
(ID) = 8 bytes.  For fields with tons of unique terms that might be worth it?  
  
> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845398#action_12845398
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
I think we can do even better, ie, that class wastes RAM for the single posting 
case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are 
not needed).

EG we could have a separate class dedicated to the singleton case. When term is 
first encountered it's enrolled there. We'd probably need a separate hash to 
store these (though not necessarily?). If it's seen again it's switched to the 
full posting.
{quote}

Hmm I think we'd need a separate hash.  Otherwise you have to subclass 
PostingList for the different cases (freq. vs. non-frequent terms) and do 
instanceof checks? Or with the parallel arrays idea maybe we could encode more 
information in the dense ID? E.g. use one bit to indicate if that term occurred 
more than once. 

{quote}
I mean instead of allocating an instance per unique term, we assign an integer 
ID (dense, ie, 0, 1, 2...).

And then we have an array for each member now in 
FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. 
Then to look up say the lastDocID for a given postingID you just get 
lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we 
can make these arrays paged... but that'd slow down each access.
{quote}

Yeah I like that idea. I've done something similar for representing trees - I 
had a very compact Node class with no data but such a dense ID, and arrays that 
stored the associated data.  Very easy to add another data type with no RAM 
overhead (you only use the amount of RAM the data needs).

Though, the price you pay is for dereferencing multiple times for each array?  
And how much RAM would we safe? The pointer for the PostingList object (4-8 
bytes), plus the size of the object header - how much is that in Java? 

Seems ilke it's 8 bytes: 
http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html

So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes 
(ID) = 8 bytes.  For fields with tons of unique terms that might be worth it?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-15 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845391#action_12845391
 ] 

Michael Busch commented on LUCENE-2293:
---

I'll reply on LUCENE-2324.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845199#action_12845199
 ] 

Michael Busch commented on LUCENE-2324:
---

Here is an interesting article about allocation/deallocation on modern JVMs:
http://www.ibm.com/developerworks/java/library/j-jtp09275.html

And here is a snippet that mentions how pooling is generally not faster anymore:


Allocation in JVMs was not always so fast -- early JVMs indeed had poor 
allocation and garbage collection performance, which is almost certainly where 
this myth got started. In the very early days, we saw a lot of "allocation is 
slow" advice -- because it was, along with everything else in early JVMs -- and 
performance gurus advocated various tricks to avoid allocation, such as object 
pooling. (Public service announcement: Object pooling is now a serious 
performance loss for all but the most heavyweight of objects, and even then it 
is tricky to get right without introducing concurrency bottlenecks.) However, a 
lot has happened since the JDK 1.0 days; the introduction of generational 
collectors in JDK 1.2 has enabled a much simpler approach to allocation, 
greatly improving performance. 




> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845190#action_12845190
 ] 

Michael Busch commented on LUCENE-2293:
---

OK I opened LUCENE-2324.  We can close this one after you committed your patch, 
Mike.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-14 Thread Michael Busch (JIRA)
Per thread DocumentsWriters that write their own private segments
-

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


See LUCENE-2293 for motivation and more details.

I'm copying here Mike's summary he posted on 2293:

Change the approach for how we buffer in RAM to a more isolated
approach, whereby IW has N fully independent RAM segments
in-process and when a doc needs to be indexed it's added to one of
them. Each segment would also write its own doc stores and
"normal" segment merging (not the inefficient merge we now do on
flush) would merge them. This should be a good simplification in
the chain (eg maybe we can remove the *PerThread classes). The
segments can flush independently, letting us make much better
concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845159#action_12845159
 ] 

Michael Busch commented on LUCENE-2293:
---

bq. How about a new issue?

OK, will open one.

bq. (if Zipf's law is applying, half the terms should be singletons; if it's 
not, you could have many more singleton terms...)

Yeah we should utilize our knowledge of term distribution to optimize in-memory 
postings.  For example, currently a nice optimization would be to store the 
first posting in the PostingList object and only allocate slices once you see 
the second occurrence (similar to the pulsing codec)?

bq.  Though... to reduce our per-unique-term RAM cost, we may want to move away 
from separate postings object per term to parallel arrays.

What exactly do you mean with parallel arrays? Parallel to the termHash array?  
Then the termsHash array would not be an array of PostingList objects anymore, 
but an array of pointers into the char[] array?  And you'd have e.g. a parallel 
int[] array for df, another int[] for pointers into the postings byte pool, 
etc? Something like that?

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845157#action_12845157
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
Michael are you also going to [first] tackle truly separating the RAM segments? 
I think we need this first ...
{quote}

Yeah I agree.  I started working on a patch for separating the doc writers 
already.

I also have a separate indexing chain prototype working with searchable RAM 
buffer (single-threaded), but slightly different postinglist format (some docs 
nowadays only have 140 characters ;) ). It seems really fast.  I spent a long 
time thinking about lock-free algorithms and data structures, so indexing 
performance should be completely independent of the search load (in theory).  I 
need to think a bit more about how to make it work with "normal" documents and 
Lucene's current in-memory format.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845155#action_12845155
 ] 

Michael Busch commented on LUCENE-2312:
---

Well, we need to keep our transactional semantics. So I assume while a flush 
will happen per doc writer independently, a commit will trigger all (per 
thread) doc writers to flush. Then a rollback also has to abort all per thread 
doc writers.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845048#action_12845048
 ] 

Michael Busch commented on LUCENE-2293:
---

I'm tempted to get rid of the pooling for PostingLIst objects.  The objects are 
very small and Java does a good job since 1.5 with object creation and gc.  I 
even read that the JVM guys think that pooling can be slower than not-pooling.

Also, I've mostly seen gc performance problems so far if there were a big 
number of long-living objects - it makes the mark time of the garbage 
collection very long.  Pooling of course exactly gets you in such a situation.

So what do you think about removing the pooling of the PostingList objects?  

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845047#action_12845047
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
but does anyone out there wanna work out the "private RAM segments"?
{quote}

Shall we use this issue for the private RAM segments? Or do you want to commit 
the simple patch, close this one and open a new issue?

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2312:
--

Fix Version/s: (was: 3.0.2)
   3.1

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845032#action_12845032
 ] 

Michael Busch commented on LUCENE-2312:
---

I'll try to tackle this one!

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-2312:
-

Assignee: Michael Busch

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845031#action_12845031
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
Also, we could store the first docID stored into the term, too - this
way we could have a ordered collection of terms, that's shared across
several open readers even as changes are still being made, but each
reader skips a given term if its first docID is greater than the
maxDoc it's searching. That'd give us point in time searching even
while we add terms with time...
{quote}

Exactly. This is what I meant in my comment: 
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

But I mistakenly said lastDocID; of course firstDocID is correct.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)

2010-03-07 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842502#action_12842502
 ] 

Michael Busch commented on LUCENE-2302:
---

Hmm maybe this is too much magic? Wouldn't it be simpler to have two completely 
separate attributes? E.g. CharTermAttribute and ByteTermAttribute. Plus an API 
in the indexer that specifies which one to use? 

> Replacement for TermAttribute+Impl with extended capabilities (byte[] 
> support, CharSequence, Appendable)
> 
>
> Key: LUCENE-2302
> URL: https://issues.apache.org/jira/browse/LUCENE-2302
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
> Fix For: Flex Branch
>
>
> For flexible indexing terms can be simple byte[] arrays, while the current 
> TermAttribute only supports char[]. This is fine for plain text, but e.g 
> NumericTokenStream should directly work on the byte[] array.
> Also TermAttribute lacks of some interfaces that would make it simplier for 
> users to work with them: Appendable and CharSequence
> I propose to create a new interface "CharTermAttribute" with a clean new API 
> that concentrates on CharSequence and Appendable.
> The implementation class will simply support the old and new interface 
> working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of 
> this. So if somebody adds a TermAttribute, he will get an implementation 
> class that can be also used as CharTermAttribute. As both attributes create 
> the same impl instance both calls to addAttribute are equal. So a TokenFilter 
> that adds CharTermAttribute to the source will work with the same instance as 
> the Tokenizer that requested the (deprecated) TermAttribute.
> To also support byte[] only terms like Collation or NumericField needs, a 
> separate getter-only interface will be added, that returns a reusable 
> BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will 
> also support this interface. For backwards compatibility with old 
> self-made-TermAttribute implementations, the indexer will check with 
> hasAttribute(), if the BytesRef getter interface is there and if not will 
> wrap a old-style TermAttribute (a deprecated wrapper class will be provided): 
> new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the 
> indexer then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-05 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841923#action_12841923
 ] 

Michael Busch commented on LUCENE-2293:
---

bq. So about the int[], would that be of the size of the index (flushed and 
unflushed) segments? Suppose that:

Each DW would have its own int[]. The size would correspond to the number of 
docs the DW has in its buffer.

{quote}
I've indexed 5 documents, flushed. (IDs 0-4)
Indexed 2 on DW1. (IDs 0,1)
Indexed 2 on DW2. (IDs 0,1)
Delete by term which affects: flushed IDs 1, 4, DW1-0, DW2 - 0, 1
Would the int[] be of size 9, and the deleted IDs be 1, 4, 5, 7, 8? How would 
DW1- be mapped to 5, and DW2-0,1 be mapped to 7 and 8? Will the int[] be 
initially of size 5 and after DW1 flushes expand to 7, and ID=5 will be set 
(and afterwards expand to 9 with IDs 7,8)? If so then I understand.
{quote}

DW1 will have an int[] of size 2, and DW2 will also have a separate int[] of 
size 2.

I think you were thinking of one big int[] across the entire index? I believe 
you will understand the whole approach now when you think of the int[]s as per 
ram segment.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-05 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841915#action_12841915
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
This is a great approach for speeding up NRT - NRT readers will no longer have 
to flush. It's similar in spirit to LUCENE-1313, but that issue is still 
flushing segments (but, into an intermediate RAMDir).
{quote}

I agree! Thinking further about this: Each (re)opened RAM segment reader needs 
to also remember the maxDoc of the corresponding DW at the time it was 
(re)opened. This way we can prevent a RAM reader to read postinglists beyond 
that maxDoc, even if the writer thread keeps building the lists in parallel. 
This allows us to guarantee the point-in-time requirements.

Also, the PostingList objects we store in the TermHash already contain a 
lastDocID (if I remember correctly). So when a RAM reader termEnum iterates the 
dictionary it can skip all terms where term.lastDocID > RAMReader.maxDoc.

It's quite neat that all we have to do in reopen then is to update 
ramReader.maxDoc and ramReader.seqID.

Of course one big thing is still missing: keeping the term dictionary sorted. 
In order to implement the full IndexReader interface, specifically TermEnum, 
it's necessary to give each RAM reader a point-in-time sorted dictionary. At 
least in one direction, as a TermEnum only seeks forward.

I think we have two options here: Either we try to keep the dictionary always 
sorted, whenever a term is added. I guess then we'd have to implement a b-tree 
or something similar?

The second option I can think of is to add a "nextTerm" pointer to 
TermHash.Postinglist. This allows us to build up a linked list across all 
terms. When a ramReader is opened we would sort all terms, but not by changing 
their position in the hash - instead by building the single-linked list in 
sorted order.

When a new reader gets (re)opened we need to mergesort the new terms into the 
linked list. I guess it's easy to get this implemented lock-free. E.g. if you 
have the linked list a->c, and you want to add b in the middle, you set b->c 
before changing a->c. Then it's undefined if an in-flight older reader would 
see term b. The old reader must not return b, since b was added after the old 
reader was (re)opened. So either case is fine: either it doesn't see b cause 
the link wasn't updated yet, or it sees it but doesn't return it, because 
b.lastDocID>ramReader.maxDoc.

The downside is that we will have to pay the price of sorting in reader.reopen, 
which however should be cheap if readers are reopened frequently. Not sure 
though if this linkedlist approach is more or less compelling than something 
like a btree?

Btw: Shall we open a new "searchable DW buffer" issue or continue using this 
issue for these discussions?

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run

[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-05 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841745#action_12841745
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
Won't this complicate the entire solution? What I liked about keeping each DW 
separate (and call it SegmentWriter) is that it really operates on its own. 
When a delete happens on IW, it is synced so that it could be registered on all 
DWs. But besides that, the DWs don't know about each other nor care. Code 
should be really simple that way - the only thing that will be shared is the 
pool of buffers.
{quote}

What I'm proposing is not different or makes it more complicated. Either way, 
you have to apply all deletes on all DWs, because you delete by query or term.

This might not be the right time for this proposal, because it'll only work 
with searchable DW buffers. But I wanted to mention this idea already, so that 
we can keep it in mind. And hopefully we can work on searchable DW buffers soon.

{quote}
but does anyone out there wanna work out the "private RAM segments"?
{quote}

I would like to try to help, but I'm likely not going to have enough time right 
now to write an entire patch for this big change myself.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-05 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841744#action_12841744
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
But if each DW maintains its own doc IDs, separately from the others, what will 
be stored in the int[]? DW1 deleted docID 0 (its 0) and DW4 deleted the same. 
The two documents are not the same one ... no? 
{quote}

In DW you don't delete by docID. You can only delete by term or query. You have 
to run the (term)query in all DWs to determine if any of the DWs have one or 
more matching docs that have to be deleted.

Today the queries and/or terms are buffered, along with the maxDocID at the 
time the delete or update was called. They are applied just after the DW buffer 
was flushed to a segment, be cause that's the first time the docs are 
searchable and the delete queries can be executed.

In the future, when we can search the DW buffer(s), you can apply the deletes 
right away. Using this int[] approach for deletes will avoid the need of 
cloning bitsets in each reopen. 

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841617#action_12841617
 ] 

Michael Busch commented on LUCENE-2293:
---

bq. The big advantage is that all (re)opened readers can share the single int[] 
array.

Dirty reads will be a problem with sharing the array. An AtomicIntegerArray 
could be used. We need to experiment how expensive that would be. 

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841545#action_12841545
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
I thought that when (3) happens, the delete-by-term needs to be issued against 
all DWs, so that later when they apply their deletes they'll remember to do so. 
Issuing that against all DWs will record the docID of each DW up until which 
the delete should apply.
{quote}

Yes, you still need to apply deletes on all DWs. My approach is not different 
in that regard.

{quote}
Also, I don't see the advantage of moving to store the deletes in int[] rather 
than bitset ... is it just to avoid calling the get(doc)?
{quote}

The big advantage is that all (re)opened readers can share the single int[] 
array. If you use a bitset you need to clone it for each reader. With the int[] 
reopening becomes basically free from a deletes perspective.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841407#action_12841407
 ] 

Michael Busch commented on LUCENE-2293:
---

bq. Yes, I think each DW will have to record its own buffered delete 
Term/Query, mapping to its docID at the time the delete arrived. 

I think in the future deletes in DW could work like this:
- DW keeps of course track of a private sequence id, which gets incremented in 
the add, delete, update calls
- a DW has a getReader() call, the reader can search the ram buffer
- when DW.gerReader() gets called, then the new reader remembers the current 
seqID at the time it was opened - let's call it RAMReader.seqID; if such a 
reader gets reopened, simply its seqID gets updated.
- we keep an growing int array with the size of DW's maxDoc, which replaces the 
usual deletes bitset
- when DW.updateDocument() or .deleteDocument() needs to delete a doc we do 
that right away, before inverting the new doc. We can do that by running a 
query using a RAMReader to find all docs that must be deleted. Instead of 
flipping a bit in a bitset, for each hit we now keep track of when it was 
deleted:

{code}
// init each slot in deletes array with -1
static final int NOT_DELETED = Integer.MAX_INT;
...
Arrays.fill(deletes, NOT_DELETED);

...

public void deleteDocument(Query q) {
  reopen RAMReader
  run query q using RAMReader
  for each hit {
int hitDocId = ...
if (deletes[hitDocId] == NOT_DELETED) {
  deletes[hitDocId] = DW.seqID;
}
  }
...
  DW.seqID++;
}
{code}

Now no matter of how often you (re)open RAMReaders, they can share the deletes 
array. No cloning like with the BitSet approach would be necessary:

When the RAMReader iterates posting lists it's as simple as this to treat 
deletes docs correctly. Instead of doing this in RAMTermDocs.next():
{code}
  if (deletedDocsBitSet.get(doc)) {
skip this doc
 }
{code}

we can now do:

{code}
  if (deletes[doc] < ramReader.seqID) {
skip this doc
  }
{code}

Here is an example:
1. Add 3 docs with DW.addDocument() 
2. User opens ramReader_a
3. Delete doc 1
4. User opens ramReader_b


After 1: DW.seqID = 2; deletes[]={MAX_INT, MAX_INT, MAX_INT}
After 2: ramReader_a.seqID = 2
After 3: DW.seqID = 3; deletes[]={MAX_INT, 2, MAX_INT}
After 3: ramReader_b.seqID = 3

Note that both ramReader_a and ramReader_b share the same deletes[] array. Now 
when ramReader_a is used to read posting lists, it will not treat doc 1 as 
deleted, because (deletes[1] < ramReader_a.seqID) = (2 < 2) = false; But 
ramReader_b will see it as deleted, because (deletes[1] < ramReader_b.seqID) = 
(2 < 3) = true.

What do you think about this approach for the future when we have a searchable 
DW buffer?

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, to

[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841388#action_12841388
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
But, I was proposing a bigger change (call it "private RAM segments"):
there would be multiple DWs, each one writing to its own private RAM
segment (each one getting private docID assignment) and its own doc
stores.
{quote}

Cool! I wasn't sure if you wanted to give them private doc stores too. +1, I 
like it.



> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841135#action_12841135
 ] 

Michael Busch commented on LUCENE-2293:
---

Sorry - after reading my comment again I can see why it was confusing. 
Loadbalancer wasn't a very good analogy.

I totally agree that Lucene should still piggyback on the application's threads 
and not start its own thread for document inversion.

Today, as you said, does the DocumentsWriter manage a certain number of thread 
states, has the WaitQueue, and its own memory management.

What I was thinking was that it would be simpler if the DocumentsWriter was 
only used by a single thread. The IndexWriter would have multiple 
DocumentsWriters and do the thread binding (+waitqueue). This would make the 
code in DocumentsWriter and the downstream classes simpler. The side-effect is 
that each DocumentsWriter would manage its own memory. 

{quote}
Also, I thought that each thread writes to different ThreadState does not 
ensure documents are written in order, but that finally when DW flushes, the 
different ThreadStates are merged together and one segment is written, somehow 
restores the orderness ...
{quote}

Stored fields are written to an on-disk stream (docstore) in order. The 
WaitQueue takes care of finishing the docs in the right order. 
The postings are written into TermHashes per threadstate in parallel. The doc 
ids are in increasing order, but can have gaps. E.g. Threadstate 1 inverts doc 
1 and 3, Threadstate 2 inverts doc 2. When it's time to flush the whole buffer 
these different TermHash postingslists get interleaved.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-04 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841120#action_12841120
 ] 

Michael Busch commented on LUCENE-2293:
---

{quote}
Also, in the pull approach, Lucene would introduce another place where it 
allocates threads.
{quote}

What I described is not much different from what's happening today. 
DocumentsWriter has already a WaitQueue, that ensures that the docs are written 
in the right order.

I simply tried to suggest a way to refactor our classes... functionally the 
same as what Mike suggested. I shouldn't have said "pulled from" (the queue).

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-03 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840952#action_12840952
 ] 

Michael Busch commented on LUCENE-2293:
---

bq. I hope we won't lose monotonic docIDs for a singlethreaded indexation 
somewhere along that path.

No. The order in the single threaded case won't be different from today with 
the changes Mike is proposing.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-03 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840911#action_12840911
 ] 

Michael Busch commented on LUCENE-2293:
---

Good timing - a couple days ago I was thinking about how threading could be 
changed in the indexer.

The other downside is that you would have to buffer deleted docs and queries 
separately for each thread state, because you have to keep the private docID? 
So that would nee a bit more memory.

Couldn't we make the DocumentsWriter and all related down-stream classes 
single-threaded then? The IndexWriter (or a new class) would have the doc 
queue, basically a load balancer, that multiple DocumentsWriter instances would 
pull from as soon as they are done inverting the previous document?

This would allow us to simplify the indexer chain a lot - we could get rid of 
all the *PerThread classes. We'd also have to separate then the docstores from 
the DocumentsWriter, so that multiple DocumentsWriter instances could share it. 
(what I'd like to do anyway for LUCENE-2026 anyway).

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2010-02-24 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2126:
--

Attachment: lucene-2126.patch

Updated patch to trunk.

I'll have to make a change to the backwards-tests too, because moving the 
copyBytes() method from IndexOutput to DataOutput and changing its parameter 
from IndexInput to DataInput breaks drop-in compatibility. 


> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch, lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2010-01-03 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795964#action_12795964
 ] 

Michael Busch commented on LUCENE-2126:
---

There has been silence here, so I hope everyone is ok with this change now?

I'll commit this in a day or two if nobody objects!

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-01-03 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795963#action_12795963
 ] 

Michael Busch commented on LUCENE-2186:
---

Great to see progress here, Mike!

{quote}
String fields are stored as the UTF8 byte[]. This patch adds a
BytesRef, which does the same thing as flex's TermRef (we should merge
them).
{quote}

It looks like ByteRef is very similar to Payload? Could you use that instead 
and extend it with the new String constructor and compare methods? 

{quote}
It handles 3 types of values:
{quote}

So it looks like with your approach you want to support certain
"primitive" types out of the box, such as byte[], float, int, String?
If someone has custom data types, then they have, similar as with
payloads today, the byte[] indirection? 

The code I initially wrote for 1231 exposed IndexOutput, so that one
can call write*() directly, without having to convert to byte[]
first. I think we will also want to do that for 2125 (store attributes
in the index). So I'm wondering if this and 2125 should work
similarly? 
Thinking out loud: Could we have then attributes with
serialize/deserialize methods for primitive types, such as float?
Could we efficiently use such an approach all the way up to
FieldCache? It would be compelling if you could store an attribute as
CSF, or in the postinglist, retrieve it from the flex APIs, and also
from the FieldCache. All would be the same API and there would only be
one place that needs to "know" about the encoding (the attribute).

{quote}
Next step is to do basic integration with Lucene, and then compare
sort performance of this vs field cache.
{quote}

Yeah, that's where I got kind of stuck with 1231: We need to figure
out how the public API should look like, with which a user can add CSF
values to the index and retrieve them. The easiest and fastest way
would be to add a dedicated new API. The cleaner one would be to make the whole
Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt.

{quote}
There are abstract Writer/Reader classes. The current reader impls
are entirely RAM resident (like field cache), but the API is (I think)
agnostic, ie, one could make an MMAP impl instead.

I think this is the first baby step towards LUCENE-1231. Ie, it
cannot yet update values, and the reading API is fully random-access
by docID (like field cache), not like a posting list, though I
do think we should add an iterator() api (to return flex's DocsEnum)
{quote}

Hmm, so random-access would obviously be the preferred approach for SSDs, but
with conventional disks I think the performance would be poor? In 1231
I implemented the var-sized CSF with a skip list, similar to a posting
list. I think we should add that here too and we can still keep the
additional index that stores the pointers? We could have two readers:
one that allows random-access and loads the pointers into RAM (or uses
MMAP as you mentioned), and a second one that doesn't load anything
into RAM, uses the skip lists and only allows iterator-based access?

About updating CSF: I hope we can use parallel indexing for that. In
other words: It should be possible for users to use parallel indexes
to update certain fields, and Lucene should use the same approach
internally to store different "generations" of things like norms and CSFs.

> First cut at column-stride fields (index values storage)
> 
>
> Key: LUCENE-2186
> URL: https://issues.apache.org/jira/browse/LUCENE-2186
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2186.patch
>
>
> I created an initial basic impl for storing "index values" (ie
> column-stride value storage).  This is still a work in progress... but
> the approach looks compelling.  I'm posting my current status/patch
> here to get feedback/iterate, etc.
> The code is standalone now, and lives under new package
> oal.index.values (plus some util changes, refactorings) -- I have yet
> to integrate into Lucene so eg you can mark that a given Field's value
> should be stored into the index values, sorting will use these values
> instead of field cache, etc.
> It handles 3 types of values:
>   * Six variants of byte[] per doc, all combinations of fixed vs
> variable length, and stored either "straight" (good for eg a
> "title" field), "deref" (good when many docs share the same value,
> but you won't do any sorting) or "sorted".
>   * Integers (variable bit precision used as necessary, ie this can
> store byte/short/int/long, and all precisions in between)
>   * Floats (4 or 8 byte precision)
> String fields are stored as the UTF8 byte[].  This patch adds a
> Bytes

[jira] Commented: (LUCENE-2182) DEFAULT_ATTRIBUTE_FACTORY faills to load implementation class when iterface comes from different classloader

2009-12-28 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794830#action_12794830
 ] 

Michael Busch commented on LUCENE-2182:
---

Looks like a good solution!

Thanks for taking care of this, Uwe!

{quote}
Should we backport this to 2.9 and 3.0 (which is easy)?
{quote}

+1

> DEFAULT_ATTRIBUTE_FACTORY faills to load implementation class when iterface 
> comes from different classloader
> 
>
> Key: LUCENE-2182
> URL: https://issues.apache.org/jira/browse/LUCENE-2182
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9.1, 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2182.patch
>
>
> This is a followup for 
> [http://www.lucidimagination.com/search/document/1724fcb3712bafba/using_the_new_tokenizer_api_from_a_jar_file]:
> The DEFAULT_ATTRIBUTE_FACTORY should load the implementation class for a 
> given attribute interface from the same classloader like the attribute 
> interface. The current code loads it from the classloader of the 
> lucene-core.jar file. In solr this fails when the interface is in a JAR file 
> coming from the plugins folder. 
> The interface is loaded correctly, because the 
> addAttribute(FooAttribute.class) loads the FooAttribute.class from the plugin 
> code and this with success. But as addAttribute tries to load the class from 
> its local lucene-core.jar classloader it will not find the attribute.
> The fix is to tell Class.forName to use the classloader of the corresponding 
> interface, which is the correct way to handle it, as the impl and the 
> attribute should always be in the same classloader and file.
> I hope I can somehow add a test for that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-13 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789946#action_12789946
 ] 

Michael Busch commented on LUCENE-2126:
---

{quote}
So first, can we perhaps name them otherwise, like LuceneInput/Output or 
something similar, to not confuse w/ Java's?
{quote}

Hmm, I was a bit concerned about confusion first too. But I'm, like Mark, not 
really liking LuceneInput/Output. I'd personally be ok with keeping 
DataInput/Output. But maybe we can come up with something better? Man, naming 
is always so hard... :)

{quote}
Second, why not have them implement Java's DataInput/Output, and add on top of 
them additional methods, like readVInt(), readVLong() etc.?
{quote}

I considered that, but Java's interfaces dictate what string encoding to use:
(From java.io.DataInput's javadocs)
{noformat}
Implementations of the DataInput and DataOutput interfaces represent Unicode 
strings in a format that is a slight modification of UTF-8.
{noformat}

E.g. DataInput defines readChar(), which we'd have to implement. But in 
IndexInput we deprecated readChars(), because we don't want to use modified 
UTF-8 anymore.

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-13 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789944#action_12789944
 ] 

Michael Busch commented on LUCENE-2126:
---

{quote}
What does a "normal" user do with a file?

   Step 1: Open the file.
   Step 2: Write data to the file.
   Step 3: Close the file.

Then, later...

   Step 1: Open the file.
   Step 2: Read data from the file.
   Step 3: Close the file.

You're saying that Lucene's file abstraction is easier to understand if you
break that up?
{quote}

No, I'm saying "normal" users do not work directly with files, so they won't do 
any of your steps above. They don't need to know those I/O related classes 
(except Directory).

DataInput/Output is about encoding/decoding of data, which is all a user of 
2125 needs to worry about. The user doesn't have to know that the attribute is 
first serialized into byte slices in TermsHashPerField and then written into 
the file(s) the actual codec defines.  

{quote}
But the idea that this strange fragmentation of the IO hierarchy makes things
easier - I don't get it at all. And I certainly don't see how it's such an
improvement over what exists now that it justifies a change to the public API.
{quote}

It makes it easier for a 2125 user. It does not make it harder for someone 
"advanced" who's dealing with IndexInput/Output already.

It makes it also cleaner - look e.g. at ByteSliceReader/Writer: those classes 
just currently throw RuntimeExceptions in the methods that this patch leaves in 
IndexInput/Output. Why? Because they're not dealing with file I/O, but with 
data (de)serialization.

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-12 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789834#action_12789834
 ] 

Michael Busch commented on LUCENE-2126:
---

I disagree with you here: introducing DataInput/Output makes IMO the API 
actually easier for the "normal" user to understand.

I would think that most users don't implement IndexInput/Output extensions, but 
simply use the out-of-the-box Directory implementations, which provide 
IndexInput/Output impls. Also, most users probably don't even call the 
IndexInput/Output APIs directly. 

{quote}
Do nothing and assume that the sort of advanced user who writes a posting
codec won't do something incredibly stupid like call indexInput.close().
{quote}

Writing a posting code is much more advanced compared to using 2125's features. 
Ideally, a user who simply wants to store some specific information in the 
posting list, such as a boost, a part-of-speech identifier, another VInt, etc. 
should with 2125 only have to implement a new attribute including the 
serialize()/deserialize() methods. People who want to do that don't need to 
know anything about Lucene's API layer. They only need to know the APIs that 
DataInput/Output provide and will not get confused with methods like seek() or 
close(). For the standard user who only wants to write such an attribute it 
should not matter how Lucene's IO structure looks like - so even if we make 
changes that go into Lucy's direction in the future (IndexInput/Output owning a 
filehandling vs. the need to extend them) the serialize()/deserialize() methods 
of attribute would still work with DataInput/Output.

I bet that a lot of people who used the payload feature before took a 
ByteArrayOutputStream together with DataOutputStream (which implements Java's 
DataOutput) to populate the payload byte array. With 2125 Lucene will provide 
an API that is similar to use, but more efficient as it remove the byte[] array 
indirection and overhead.

I'm still +1 for this change. Others?

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-12 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789834#action_12789834
 ] 

Michael Busch edited comment on LUCENE-2126 at 12/13/09 1:22 AM:
-

I disagree with you here: introducing DataInput/Output makes IMO the API 
actually easier for the "normal" user to understand.

I would think that most users don't implement IndexInput/Output extensions, but 
simply use the out-of-the-box Directory implementations, which provide 
IndexInput/Output impls. Also, most users probably don't even call the 
IndexInput/Output APIs directly. 

{quote}
Do nothing and assume that the sort of advanced user who writes a posting
codec won't do something incredibly stupid like call indexInput.close().
{quote}

Writing a posting code is much more advanced compared to using 2125's features. 
Ideally, a user who simply wants to store some specific information in the 
posting list, such as a boost, a part-of-speech identifier, another VInt, etc. 
should with 2125 only have to implement a new attribute including the 
serialize()/deserialize() methods. People who want to do that don't need to 
know anything about Lucene's API layer. They only need to know the APIs that 
DataInput/Output provide and will not get confused with methods like seek() or 
close(). For the standard user who only wants to write such an attribute it 
should not matter how Lucene's IO structure looks like - so even if we make 
changes that go into Lucy's direction in the future (IndexInput/Output owning a 
filehandle vs. the need to extend them) the serialize()/deserialize() methods 
of attribute would still work with DataInput/Output.

I bet that a lot of people who used the payload feature before took a 
ByteArrayOutputStream together with DataOutputStream (which implements Java's 
DataOutput) to populate the payload byte array. With 2125 Lucene will provide 
an API that is similar to use, but more efficient as it remove the byte[] array 
indirection and overhead.

I'm still +1 for this change. Others?

  was (Author: michaelbusch):
I disagree with you here: introducing DataInput/Output makes IMO the API 
actually easier for the "normal" user to understand.

I would think that most users don't implement IndexInput/Output extensions, but 
simply use the out-of-the-box Directory implementations, which provide 
IndexInput/Output impls. Also, most users probably don't even call the 
IndexInput/Output APIs directly. 

{quote}
Do nothing and assume that the sort of advanced user who writes a posting
codec won't do something incredibly stupid like call indexInput.close().
{quote}

Writing a posting code is much more advanced compared to using 2125's features. 
Ideally, a user who simply wants to store some specific information in the 
posting list, such as a boost, a part-of-speech identifier, another VInt, etc. 
should with 2125 only have to implement a new attribute including the 
serialize()/deserialize() methods. People who want to do that don't need to 
know anything about Lucene's API layer. They only need to know the APIs that 
DataInput/Output provide and will not get confused with methods like seek() or 
close(). For the standard user who only wants to write such an attribute it 
should not matter how Lucene's IO structure looks like - so even if we make 
changes that go into Lucy's direction in the future (IndexInput/Output owning a 
filehandling vs. the need to extend them) the serialize()/deserialize() methods 
of attribute would still work with DataInput/Output.

I bet that a lot of people who used the payload feature before took a 
ByteArrayOutputStream together with DataOutputStream (which implements Java's 
DataOutput) to populate the payload byte array. With 2125 Lucene will provide 
an API that is similar to use, but more efficient as it remove the byte[] array 
indirection and overhead.

I'm still +1 for this change. Others?
  
> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader

[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-09 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788001#action_12788001
 ] 

Michael Busch commented on LUCENE-2126:
---

The main reason why I'd like to separate DataInput/Output from 
IndexInput/Output now is LUCENE-2125. Users should be able to implement methods 
that serialize/deserialize attributes into/from a postinglist. These methods 
should only be able to call the read/write methods (which this issue moves to 
DataInput/Output), but not methods like close(), seek() etc.. 

Thanks for spending time reviewing this and giving feedback from Lucy land, 
Marvin!
I think I will go ahead and commit this, and once we see a need to allow users 
to extend DataInput/Output outside of Lucene we can go ahead and make the 
additional changes that are mentioned in your in my comments here.

So I will commit this tomorrow if nobody objects.

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-07 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787180#action_12787180
 ] 

Michael Busch commented on LUCENE-2126:
---

Thanks for the input, Marvin.

I can see the advantages of what you're proposing. With this patch it'd only be 
possible to benefit in all IndexInput/IndexOutput implementations from a new 
encoding/decoding method if you add it to the DataInput/Output class directly, 
which is only possible by changing the classes in Lucene, not from outside.

The problem here, as so often, is backwards-compat. This patch here has no 
problems in that regard, as we just move the methods into new superclasses. If 
we'd want to implement what Lucy is doing, we'd have to deprecate all 
encoding/decoding methods in IndexInput/Output and add them to 
DataInput/Output. Then a DataInput would not be the superclass of IndexInput, 
but rather *have* an IndexInput. All users who call any of the 
encoding/decoding methods directly on IndexInput/Output would have to change 
their code to use the new classes. 

So I can certainly see the benefits, the question now is if they're at the 
moment important enough to justify dealing with the backwards-compat hassle?

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2125) Ability to store and retrieve attributes in the inverted index

2009-12-07 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786855#action_12786855
 ] 

Michael Busch commented on LUCENE-2125:
---

{quote}
BTW probably the attribute should include a "merge" operation, somehow, to be 
efficient (simply byte[] copying instead of decode/encode) in the merge case.
{quote}

Yes, and then I can also close LUCENE-1585! :)

> Ability to store and retrieve attributes in the inverted index
> --
>
> Key: LUCENE-2125
> URL: https://issues.apache.org/jira/browse/LUCENE-2125
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
>
> Now that we have the cool attribute-based TokenStream API and also the
> great new flexible indexing features, the next logical step is to
> allow storing the attributes inline in the posting lists. Currently
> this is only supported for the PayloadAttribute.
> The flex search APIs already provide an AttributeSource, so there will
> be a very clean and performant symmetry. It should be seamlessly
> possible for the user to define a new attribute, add it to the
> TokenStream, and then retrieve it from the flex search APIs.
> What I'm planning to do is to add additional methods to the token
> attributes (e.g. by adding a new class TokenAttributeImpl, which
> extends AttributeImpl and is the super class of all impls in
> o.a.l.a.tokenattributes):
> - void serialize(DataOutput)
> - void deserialize(DataInput)
> - boolean storeInIndex()
> The indexer will only call the serialize method of an
> TokenAttributeImpl in case its storeInIndex() returns true. 
> The big advantage here is the ease-of-use: A user can implement in one
> place everything necessary to add the attribute to the index.
> Btw: I'd like to introduce DataOutput and DataInput as super classes
> of IndexOutput and IndexInput. They will contain methods like
> readByte(), readVInt(), etc., but methods such as close(),
> getFilePointer() etc. will stay in the super classes.
> Currently the payload concept is hardcoded in 
> TermsHashPerField and FreqProxTermsWriterPerField. These classes take
> care of copying the contents of the PayloadAttribute over into the 
> intermediate in-memory postinglist representation and reading it
> again. Ideally these classes should not know about specific
> attributes, but only call serialze() on those attributes that shall
> be stored in the posting list.
> We also need to change the PositionsEnum and PositionsConsumer APIs to
> deal with attributes instead of payloads.
> I think the new codecs should all support storing attributes. Only the
> preflex one should be hardcoded to only take the PayloadAttribute into
> account.
> We'll possibly need another extension point that allows us to influence 
> compression across multiple postings. Today we use the
> length-compression trick for the payloads: if the previous payload had
> the same length as the current one, we don't store the length
> explicitly again, but only set a bit in the shifted position VInt. Since
> often all payloads of one posting list have the same length, this
> results in effective compression.
> Now an advanced user might want to implement a similar encoding, where
> it's not enough to just control serialization of a single value, but
> where e.g. the previous position can be taken into account to decide
> how to encode a value. 
> I'm not sure yet how this extension point should look like. Maybe the
> flex APIs are actually already sufficient.
> One major goal of this feature is performance: It ought to be more 
> efficient to e.g. define an attribute that writes and reads a single 
> VInt than storing that VInt as a payload. The payload has the overhead
> of converting the data into a byte array first. An attribute on the other 
> hand should be able to call 'int value = dataInput.readVInt();' directly
> without the byte[] indirection.
> After this part is done I'd like to use a very similar approach for
> column-stride fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >