date:20100315


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845745#action_12845745
 ] 

Michael Busch commented on LUCENE-2312:
---

The tricky part is to make sure that a reader always sees a consistent snapshot 
of the index.  At the same time a reader must not follow pointers to 
non-published locations (e.g. array blocks).

I think I have a lock-free solution working, which only syncs in certain 
intervals to not prevent JVM optimizations - but I need more time for thinking 
about all the combinations and corner cases.

It's getting late now - need to sleep!

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845745#action_12845745
 ] 

Michael Busch edited comment on LUCENE-2312 at 3/16/10 6:51 AM:


The tricky part is to make sure that a reader always sees a consistent snapshot 
of the index.  At the same time a reader must not follow pointers to 
non-published locations (e.g. array blocks).

I think I have a lock-free solution working, which only syncs (i.e. does 
volatile writes) in certain intervals to not prevent JVM optimizations - but I 
need more time for thinking about all the combinations and corner cases.

It's getting late now - need to sleep!

  was (Author: michaelbusch):
The tricky part is to make sure that a reader always sees a consistent 
snapshot of the index.  At the same time a reader must not follow pointers to 
non-published locations (e.g. array blocks).

I think I have a lock-free solution working, which only syncs in certain 
intervals to not prevent JVM optimizations - but I need more time for thinking 
about all the combinations and corner cases.

It's getting late now - need to sleep!
  
> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845735#action_12845735
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

{quote}This makes the reference to the array volatile, not the
slots in the array{quote}

That's no good! :)

{quote}If you use a RW lock then the writer thread will block
all reader threads while it's making changes{quote}

We probably need to implement more fine grained locking, perhaps
using volatile booleans instead of RW locks. Fine grained
meaning on the byte array/block level. I think this would imply
that changes are not visible until a given byte block is more or
less "flushed"? This is different than the design that's been
implicated, that we'd read from byte arrays as their being
written to. We probably don't need to read from and write to the
same byte array concurrently (that might not be feasible?).

The performance win here is probably going to be the fact that
we avoid segment merges.  


> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845731#action_12845731
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
Do volatile byte arrays work
{quote}

I'm not sure what you mean by volatile byte arrays?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845731#action_12845731
 ] 

Michael Busch edited comment on LUCENE-2312 at 3/16/10 6:12 AM:


{quote}
Do volatile byte arrays work
{quote}

I'm not sure what you mean by volatile byte arrays?

Do you mean this?
{code}
volatile byte[] array;
{code}

This makes the *reference* to the array volatile, not the slots in the array.

  was (Author: michaelbusch):
{quote}
Do volatile byte arrays work
{quote}

I'm not sure what you mean by volatile byte arrays?
  
> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845729#action_12845729
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

{quote}but my goal is it here to implement a non-blocking and
lock-free algorithm. So my idea was it to make use of a very
subtle behavior of volatile variables. {quote}

You're talking about having a per thread write buffer byte
array, that on search gets copied into a read only array, or
gets transformed magically into a volatile byte array? (Do
volatile byte arrays work? I couldn't find a clear answer on the
net, maybe it's stated in the Goetz book). If volatile byte
arrays do work, an option to test would be a byte buffer pool
that uses volatile byte arrays?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845726#action_12845726
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
A quick and easy way to solve this is to use a read write lock
on the byte pool?
{quote}

If you use a RW lock then the writer thread will block all reader threads while 
it's making changes.  The writer thread will be making changes all the time in 
a real-time search environment.  The contention will kill performance I'm sure. 
 RW lock is only faster than mutual exclusion lock if writes are infrequent, as 
mentioned in the javadocs of ReadWriteLock.java

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845721#action_12845721
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

Just to clarify, I think Mike's referring to ParallelArray?

http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/extra166y/P
arallelArray.html

There's AtomicIntegerArray:
http://www.melclub.net/java/_atomic_integer_array_8java_source.html 
which underneath uses the sun.Unsafe class for volatile array
access. Could this be reused for an AtomicByteArray class (why
isn't there one of these already?).

A quick and easy way to solve this is to use a read write lock
on the byte pool? Remember when we'd sync on each read bytes
call to the underlying random access file in FSDirectory (eg,
now we're using NIOFSDir which can be a good concurrent
throughput improvement). Lets try the RW lock and examine the
results? I guess the issue is we're not writing in blocks of
bytes, we're actually writing byte by byte and need to read byte
by byte concurrently? This sounds like a fairy typical thing to
do? 


> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845712#action_12845712
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote} Hmm... what does JMM say about byte arrays? If one thread is writing
to the byte array, can any other thread see those changes? 
{quote}

This is the very right question to ask here. Thread-safety is really the by
far most complicated aspect of this feature. Jason, I'm not sure if you
already figured out how to ensure visibility of changes made by the writer
thread to the reader threads?

Thread-safety in our case boils down to safe publication. We don't need
locking to coordinate writing of multiple threads, because of LUCENE-2324. But
we need to make sure that the reader threads see all changes they need to see
at the right time, in the right order. This is IMO very hard, but we all like
challenges :)

The JMM gives no guarantee whatsover what changes a thread will see that
another thread made - or if it will ever see the changes, unless proper
publication is ensured by either synchronization or volatile/atomic variables.

So e.g. if a writer thread executes the following statements:
{code}
public static int a, b;

...

a = 1; b = 2;

a = 5; b = 6;
{code}

and a reader threads does:
{code}
System.out.println(a + "," + b);
{code}

The thing to remember is that the output might be: 1,6! Another reader thread
with the following code: 
{code}
while (b != 6) {
  .. do something 
}
{code}
might further NEVER terminate without synchronization/volatile/atomic.

The reason is that the JVM is allowed to perform any reorderings to utilize
modern CPUs, memory, caches, etc. if not forced otherwise.

To ensure safe publication of data written by a thread we could do
synchronization, but my goal is it here to implement a non-blocking and
lock-free algorithm. So my idea was it to make use of a very subtle behavior
of volatile variables. I will take a simple explanation of the JMM from Brian
Goetz' awesome book "Java concurrency in practice", in which he describes the
JMM in simple happens-before rules. I will mention only three of those rules,
because they are enough to describe the volatile behavior I'd like to mention
here (p. 341)

*Program order rule:* Each action in a thread _happens-before_ every action in
that thread that comes later in the program order.

*Volatile variable rule:* A write to a volatile field _happens-before_ every
subsequent read of that same field.

*Transitivity:* If A happens-before B, and B _happens-before_ C, then A
_happens-before_ C.

Based on these three rules you can see that writing to a volatile variable v
by one thread t1 and subsequent reading of the same volatile variable v by
another thread t2 publishes ALL changes of t1 that happened-before the write
to v and the change of v itself. So this write/read of v means crossing a
memory barrier and forcing everything that t1 might have written to caches to
be flushed to the RAM. That's why a volatile write can actually be pretty
expensive.

Note that this behavior is actually only working like I just described since
Java 1.5. Behavior of volatile variables was a very very subtle change from
1.4->1.5!

The way I'm trying to make use of this behavior is actually similar to how we
lazily sync Lucene's files with the filesystem: I want to delay the cache->RAM
write-through as much as possible, which increases the probability of getting
the sync for free! Still fleshing out the details, but I wanted to share these
infos with you guys already, because it might invalidate a lot of assumptions
you might have when developing the code. Some of this stuff was actually new
to me, maybe you all know it already.  And if anything that I wrote here is
incorrect, please let me know!

Btw: IMO, if there's only one java book you can ever read, then read Goetz'
book! It's great. He also says in the book somewhere about lock-free
algorithms: "Don't try this at home!" - so, let's do it! :)

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845703#action_12845703
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
Sounds like awesome progress!! Want some details over here :)
{quote}

Sorry for not being very specific.  The prototype I'm experimenting with has a 
fixed length postings format for the in-memory representation (in TermsHash).  
Basically every posting has 4 bytes, so I can use int[] arrays (instead of the 
byte[] pools).  The first 3 bytes are used for an absolute docID (not 
delta-encoded). This limits the max in-memory segment size to 2^24 docs.  The 1 
remaining byte is used for the position.  With a max doc length of 140 
characters you can fit every possible position in a byte - what a luxury! :)  
If a term occurs multiple times in the same doc, then the TermDocs just skips 
multiple occurrences with the same docID and increments the freq.  Again, the 
same term doesn't occur often in super short docs.

The int[] slices also don't have forward pointers, like in Lucene's TermsHash, 
but backwards pointers.  In real-time search you often want a strongly 
time-biased ranking.  A PostingList object has a pointer that points to the 
last posting (this statement is not 100% correct for visibility reasons across 
threads, but we can imagine it this way for now).  A TermDocs can now traverse 
the postinglists in opposite order.  Skipping can be done by following pointers 
to previous slices directly, or by binary search within a slice.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-15 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845702#action_12845702
 ] 

Shai Erera commented on LUCENE-2310:


i like the idea of Document to implement Iterable, but how does that solve the 
case where someone wants to query how many fields a document has? Will you 
still have getFields(), only now it will return an unmodifiable collection?

I guess the unmod collection can be returned even today, right?

BTW, what happens if getFields() return an unmod collection, but someone calls 
doc.add(Field)? I think the unmod collection prevents you from adding to that 
collection wrapper, but not for that collection to be changed from under the 
hood? If that's true, then that could cause some trouble ... so getFields() 
will really return a snapshot of Document, which means we need to clone Fields 
...

Gets too complicated no? Maybe just do: (1) Doc implements Iterable and (2) Doc 
exposes numFIelds(), add(Field)?

About remove(field), I thought of a possible scenario though I still don't 
think it's interesting enough - suppose that you pass your Document through a 
processing pipeline/chain, each handler adds fields as metadata to the 
Document. For example, annotators. It might be that a field A exists, only for 
a handler down the chain to understand A's meaning and then replace it w/ A1 
and A2. For that you'll want to be able to move a field ... I guess we could 
add a remove method to Document, and if it'll be called while the fields are 
iterated on, a CME will be thrown, which is perfectly fine with me.

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch, 
> LUCENE-2310-Deprecate-AbstractField.patch, 
> LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845696#action_12845696
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

Payloads works (non-lazy loading), however ByteSliceReader doesn't implement a 
seek method so I think we simply need to load each payload as we increment 
nextPosition?  The cost shouldn't be too much because we're simply copying 
small byte arrays (in the heap).

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene and solr trunk

2010-03-15 Thread Chris Hostetter

: prime-time as the new solr trunk!  Lucene and Solr need to move to a
: common trunk for a host of reasons, including single patches that can
: cover both, shared tags and branches, and shared test code w/o a test
: jar.

Without a clearer picture of how people envision development "overhead" 
working as we move forward, it's really hard to understand how any of 
these ideas make sense...
  1) how should hte automated build process(es) work?
  2) how are we going to do branching/tagging for releases?  particularly 
in situations where one product is ready for a rlease and hte other isn't?
  3) how are we going to deal with mino bug fix release tagging?
  4) should it be possible for people to check out Lucene-Java w/o 
checking out Solr?

(i suspect a whole lot of people who only care about the core library are 
going to really adamantly not want to have to check out all of Solr just 
to work on the core)

: Both projects move to a new trunk:
:   /something/trunk/java, /something/trunk/solr

by gut says something like this will more the most sense, assuming 
"/something/trunk" == "/java/trunk" and "java" actually means "core" ... 
ie: this discussion should really be part and parcel with how contribs 
should be reorged.



-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene and solr trunk

2010-03-15 Thread Robert Muir

On Mon, Mar 15, 2010 at 11:41 PM, Mark Miller  wrote:
>>
>> Solr moves to Lucene's trunk:
>>   /java/trunk, /java/trunk/sol
>
> +1. With the goal of merged dev, merged tests, this looks the best to me.
> Simple to do patches that span both, simple to setup
> Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it.
>

+1

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene and solr trunk

2010-03-15 Thread Mark Miller


On 03/15/2010 11:28 PM, Yonik Seeley wrote:

So, we have a few options on where to put Solr's new trunk:


Solr moves to Lucene's trunk:
   /java/trunk, /java/trunk/sol
+1. With the goal of merged dev, merged tests, this looks the best to 
me. Simple to do patches that span both, simple to setup

Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

lucene and solr trunk

2010-03-15 Thread Yonik Seeley

Due to a tremendous amount of work by our newly merged committer
corps, the get-on-lucene-trunk branch (branches/solr) is ready for
prime-time as the new solr trunk!  Lucene and Solr need to move to a
common trunk for a host of reasons, including single patches that can
cover both, shared tags and branches, and shared test code w/o a test
jar.

The current Lucene trunk is: .../lucene/java/trunk
The current Solr trunk is: .../lucene/solr/trunk

So, we have a few options on where to put Solr's new trunk:

Lucene moves to Solr's trunk:
  /solr/trunk, /solr/trunk/lucene

Solr moves to Lucene's trunk:
  /java/trunk, /java/trunk/solr

Both projects move to a new trunk:
  /something/trunk/java, /something/trunk/solr

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845680#action_12845680
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

In thinking about the terms dictionary, we're going to run into concurrency 
issues right if we just use TreeMap?  Can't we simply use the lock free 
ConcurrentSkipListMap?  Yeah it's a part of Java6 however why reinvent the 
wheel?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845679#action_12845679
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

Basic term positions working, need to figure out how to do lazy loading 
payloads...

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845663#action_12845663
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

I have a test case showing the term docs working... I'm going to try to add the 
term positions methods.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-15 Thread Marvin Humphrey

On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote:
> I mean specifically one should not have to commit to the precise
> scoring model they will use for a given field, when they index that
> field.

Yeah, I've never seen committing to a precise scoring model at index-time via
Sim choice as a big deal.  In Lucy, per-field Similarity assignments are part
of the the Schema, which has to be set at index-time.  And index-time Sim
choice is the way things have always been done in Lucene.

In any case, the proposal to start delaying Sim choice to search-time -- while
a nice feature for Lucene -- is a non-starter for Lucy.   We can't do that
because it would kill the cheap-Searcher model to generate boost bytes at
Searcher construction time and cache them within the object.  We need those
boost bytes written to disk so we can mmap them and share them amongst many
cheap Searchers.

So... you're proposing shrinking Similarity's public API by removing
functionality that Lucy can't live without.  If indeed that works out for
Lucene, the role of Similarity within the two libraries will have to diverge.
In Lucene, Similarity will get smaller; in Lucy it will expand a bit.

To my mind, these are all related data reduction tasks:

  * Omit doc-boost and field-boost, replacing them with a single float
docXfield multiplier -- because you never need doc-boost on its own.
  * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost,
replacing them all with a single boost byte -- because for the kind of
scoring you want to do, you don't need all those raw stats.
  * Omit the boost byte, because you don't need to do scoring at all.
  * Omit positions because you don't need PhraseQueries, etc. to match.
  * Omit everything except doc-id, because you only need binary matching.

What al those tasks all have in common is that we can determine what stats are
disposable based on how the user describes how they are going to use the
field.

For Lucy, the user is going to have to commit to a "precise scoring model" at
index-time by specifying a Sim choice anyway.  If that Sim turns out to be a
MatchSimilarity, why on earth should we keep around the boost bytes?

> > And what class other than Similarity knows enough about the scoring 
> > algorithm
> > to perform these data reduction tasks?  If it's not goint to be Similarity
> > itself, it has to be something that know absolutely everything about the
> > Similarity implementation's scoring model.
> 
> I don't follow this...
> 
> It will be Sim that does computes norm bytes.

I meant that if you're writing out boost bytes, there's no sensible way to
execute the lossy data reduction and reduce the index size other than having
Sim do it.  

> >  class MySim extends Similarity {
> >public PostingCodec makePostingCodec() {
> >  StandardPostingCodec codec = new StandardPostingCodec();
> >  codec.setOmitBoostBytes(true);
> >  codec.setOmitPositions(true);
> >  return (PostingCodec)codec;
> >}
> >  }
> 
> This still feels like you are mixing two very different concepts --
> what's being written (boost bytes, positions, docTermFreqs) vs how it's
> encoded (codec).  

So StandardPostingCodec shouldn't have methods like setOmitBoostBytes()?
Maybe that's right.  Guess I'll watch to see how flex pans out and what
methods you put on those PostingCodec classes.

For now, I just want to make the no-boost-bytes and doc-id-only index
optimizations available, and to achieve that, it's sufficient to implement
format-follows-sim and publish MatchSimilarity and MinimalSimilarity.  The
PostingCodec API can remain a private implementation detail until a later
date.

> Shouldn't Lucy's schema record what stats should be indexed for the field?  

No, it shouldn't -- not directly.  

You tell the Schema how you want the field to be used.  That information is
used to derive what stats are needed, and whether the ones that are needed can
be combined, compressed, etc.

> Then, any codec you swap in should respect that?  EG maybe I use PForCodec
> instead, or a PulsingCode(PForCodec)?

I guess.  I don't see publishing a PForCodec with an elaborate API as being
very important, though.  It's more important to just use PFOR internally when
it's the best choice.

> I'm thinking the various Sim classes, which you'd select during
> searching, will note in jdocs what attrs must be indexed.  It's your
> job to read that and set your field (schema) up accordingly, ie,
> enable those required attrs. 

Yeah, that'll at least get the job done for Lucene.  

I don't think it's ideal to force people to understand that stuff, but hey,
the more people are confused, the more important it is for them to buy
optimization seminars where Lucene gurus explain all the obscure incantations
to them.  :)

> > You seem to be fixated on the notion of swapping in a MatchOnlySim object at
> > search time.  You can't do that in KS/Lucy, because you can't modify a 
> > Schema
> > at

[jira] Updated: (LUCENE-2098) make BaseCharFilter more efficient in performance

2010-03-15 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2098:


Attachment: LUCENE-2098.patch

i haven't benchmarked to see if this is any faster, maybe even worse.

but its no longer a linear algorithm

> make BaseCharFilter more efficient in performance
> -
>
> Key: LUCENE-2098
> URL: https://issues.apache.org/jira/browse/LUCENE-2098
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.9
>Reporter: Koji Sekiguchi
>Priority: Minor
> Attachments: LUCENE-2098.patch
>
>
> Performance degradation in Solr 1.4 was reported. See:
> http://www.lucidimagination.com/search/document/43c4bdaf5c9ec98d/html_stripping_slower_in_solr_1_4
> The inefficiency has been pointed out in BaseCharFilter javadoc by Mike:
> {panel}
> NOTE: This class is not particularly efficient. For example, a new class 
> instance is created for every call to addOffCorrectMap(int, int), which is 
> then appended to a private list. 
> {panel}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [DISCUSS] Do away with Contrib Committers and make core committers

2010-03-15 Thread Ryan McKinley

>
> Personally I'd prefer we just stop adding them, and the current ones work
> their way up like normal if they are so inclined, or the ones that are not
> even around anymore can just stay as they are.
>

This seems reasonable to me.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845530#action_12845530
 ] 

Earwin Burrfoot commented on LUCENE-2320:
-

We could split MergePolicy in two - class that represents the policy 
(config/factory) and class that acts on that policy (instance).

So IW gets a MergePolicy that has no outside references, and creates a 
MergePoliceman from it, supplying 'this' on construction.
Thus, circular reference still exists, but is contained for good.

Not sure I totally love the idea myself though.

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845503#action_12845503
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

Also wanted to add that the PostingList lastDocID is correct.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [DISCUSS] Do away with Contrib Committers and make core committers

2010-03-15 Thread DM Smith

My 2 cents as one who has no aspirations of ever being a committer.

I think with the pending re-org of contrib and the value of contrib, it
doesn't make much sense to have the distinction between core and contrib
let alone for contributors.

Regarding the former low bar, either prune the list (voluntarily or
forcefully), prune individuals when they commit something they really,
really shouldn't have (e.g. no discussion, no consensus), or give
several opportunities to do right then prune.

But in any case, spell out the expectations and document it (perhaps in
the wiki).

I think it can work and there will be little if any problem with it.

-- DM

On 03/15/2010 02:33 PM, Grant Ingersoll wrote:

On Mar 15, 2010, at 1:25 PM, Mark Miller wrote:

On 03/15/2010 08:33 AM, Grant Ingersoll wrote:

Right, Mark. I think we would be effectively raising the bar to some extent
for what it takes to be a committer.

That's part of my point though - some are contrib committers with a lower bar -
now they are core/solr committers with that lower bar, but someone else that
came along would not get to the same position now?

I think they may just have a little more work to do, either that or maybe we
just have a little more faith that the right things will be done.

We'd also be making contrib a first class citizen (not that it ever wasn't,
but some people have that perception).

I think because it was kind of true. I could come along before and donate
contrib x, and never show I worked well with the community or build up the
merit needed to be a committer, and be made a contrib committer simply to
maintain my module. That's happened plenty.

True. I guess what I'm saying is we can still make them committers and it may be that they still only will
work on "their" module, but we should base our vote on them being "full" committers. I
don't like the notion of modules belonging to someone (not that you were implying that, I know.) I guess I
just see it as you either have earned merit or not. That's how we do it in Solr and Mahout and they both
have modules/contribs and it also fits more with the notion of "one project, one set of committers".

Finally, I think we need to recognize that not everyone needs to be a
McCandless in order to contribute in a helpful way.

We obviously recognize that or else I wouldn't be here! I think its more about
fitting in - showing you get and follow the Apache way. Showing that ideas and
changes you might push are in line with what the other committers thing is
appropriate of a core/solr committer. Talent is not key here - community is.
The bar for this has been *much* higher core than contrib in the past. And
contrib has had different bars over time - I think it was even lower in the
past at points.

Agreed.

I think sometimes we forget that you can do svn revert.

I hate to have to do that. I don't think its a great way to handle this - we
could make everyone a committer at a drop of a hat and say we can just revert.
I wouldn't call for a revert except in exceptional circumstances. I don't think
that's the point.

Right, obviously I wasn't implying we'd want to do it, but we can if it is
absolutely necessary.
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845493#action_12845493
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

{quote}Ahh, I think it's because you're not calling
compactPostings/sortPostings in the THPF, right?

Those methods collapse the hash table in-place (ie move all the
nulls out), and sort.{quote}

Yep, got that part. 

{quote}So you have to re-work the code to not do that and
instead use whatever structure you have for visiting terms in
sorted order. Then stepping through the docs should just work,
but, you gotta stop at the max docID, right?{quote}

Right, the terms in sorted order is working... The freq
ByteSliceReader is reading nothing however (zeroes). Either it's
init'ed to the wrong position, or there's nothing in there? Or
something else. 

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-15 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845490#action_12845490
 ] 

Shai Erera commented on LUCENE-2320:


The thing is that we were at that position already, before I changed it so that 
MP requires writer up front. The reason was, like Mike mentioned, that writer 
had to be passed on all method calls, for really no good reason. A MP is 
usually coupled w/ an IW instance and I don't think we should opt for 
decoupling them.

Most of this patch removes MP setting from IW to IWC (and hence changes test 
code to use the new API). The SetOnce juggling is done only to ensure an IW is 
set exactly once on MP, and allows us to resolve that circular dependency. We 
can do two things:
# Continue w/ SetOnce as introduced in this patch.
# Introduce a setIndexWriter on MP which anyone can call, even more than once.

With (1) I don't think we complicate anything, and SetOnce can be useful in 
other places as well. (2) is really like passing writer on all method calls, so 
let's at least not have it as part of all methods signature. I prefer (1) 
slightly over (2) but am fine w/ (2) as well. I wouldn't want to change MP back 
to require IW on all its methods.

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2325) investigate solr test failures using flex


 [ 
https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2325.


Resolution: Fixed

Solr can now run on flex :)

> investigate solr test failures using flex
> -
>
> Key: LUCENE-2325
> URL: https://issues.apache.org/jira/browse/LUCENE-2325
> Project: Lucene - Java
>  Issue Type: Test
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Michael McCandless
> Fix For: Flex Branch
>
> Attachments: LUCENE-2325.patch, LUCENE-2325.patch
>
>
> We have a branch of Solr located here: 
> https://svn.apache.org/repos/asf/lucene/solr/branches/solr
> Currently all the tests pass with lucene trunk jars.
> I plopped in the flex jars and they do not, so I thought these might be 
> interesting to look at.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2325) investigate solr test failures using flex


 [ 
https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2325:
---

Attachment: LUCENE-2325.patch

The bug was... if you asked for TermsEnum on a non-existent field on a foreign 
IndexReader (like Solr's, SolrIndexReader), so that the "emulate flex API on 
top of non-flex API" layer is used, then the returned TermsEnum would 
incorrectly return 1 term, and then null, when it should've returned null right 
off.

I'll commit shortly -- simple fix.

With this all Solr's tests pass when you drop in the flex JARs!!  Yay.

> investigate solr test failures using flex
> -
>
> Key: LUCENE-2325
> URL: https://issues.apache.org/jira/browse/LUCENE-2325
> Project: Lucene - Java
>  Issue Type: Test
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Michael McCandless
> Fix For: Flex Branch
>
> Attachments: LUCENE-2325.patch, LUCENE-2325.patch
>
>
> We have a branch of Solr located here: 
> https://svn.apache.org/repos/asf/lucene/solr/branches/solr
> Currently all the tests pass with lucene trunk jars.
> I plopped in the flex jars and they do not, so I thought these might be 
> interesting to look at.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845464#action_12845464
 ] 

Michael McCandless commented on LUCENE-2312:


Ahh, I think it's because you're not calling compactPostings/sortPostings in 
the THPF, right?

Those methods collapse the hash table in-place (ie move all the nulls out), and 
sort.

So you have to re-work the code to not do that and instead use whatever 
structure you have for visiting terms in sorted order.  Then stepping through 
the docs should just work, but, you gotta stop at the max docID, right?

Hmm... what does JMM say about byte arrays?  If one thread is writing to the 
byte array, can any other thread see those changes?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845448#action_12845448
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

The code is from FreqProxFieldMergeState which accepts in it's
constructor FreqProxTermsWriterPerField. One difference is
instead of operating on an array of posting lists, the code
above assumes one posting list.

The numPostings was always 0 when testing 
{code}this.numPostings = field.termsHashPerField.numPostings;{code} 
In the code above it's hard coded to 1. 

Maybe there's some initialization that's not happening correctly?



> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [DISCUSS] Do away with Contrib Committers and make core committers

2010-03-15 Thread Grant Ingersoll

On Mar 15, 2010, at 1:25 PM, Mark Miller wrote:

> On 03/15/2010 08:33 AM, Grant Ingersoll wrote:
>> Right, Mark.  I think we would be effectively raising the bar to some extent 
>> for what it takes to be a committer.
> 
> That's part of my point though - some are contrib committers with a lower bar 
> - now they are core/solr committers with that lower bar, but someone else 
> that came along would not get to the same position now?

I think they may just have a little more work to do, either that or maybe we 
just have a little more faith that the right things will be done.

> 
>>  We'd also be making contrib a first class citizen (not that it ever wasn't, 
>> but some people have that perception).
> 
> I think because it was kind of true. I could come along before and donate 
> contrib x, and never show I worked well with the community or build up the 
> merit needed to be a committer, and be made a contrib committer simply to 
> maintain my module. That's happened plenty.

True.  I guess what I'm saying is we can still make them committers and it may 
be that they still only will work on "their" module, but we should base our 
vote on them being "full" committers.  I don't like the notion of modules 
belonging to someone (not that you were implying that, I know.)  I guess I just 
see it as you either have earned merit or not.  That's how we do it in Solr and 
Mahout and they both have modules/contribs and it also fits more with the 
notion of "one project, one set of committers".

> 
>>  Finally, I think we need to recognize that not everyone needs to be a 
>> McCandless in order to contribute in a helpful way.
> 
> We obviously recognize that or else I wouldn't be here! I think its more 
> about fitting in - showing you get and follow the Apache way. Showing that 
> ideas and changes you might push are in line with what the other committers 
> thing is appropriate of a core/solr committer. Talent is not key here - 
> community is. The bar for this has been *much* higher core than contrib in 
> the past. And contrib has had different bars over time - I think it was even 
> lower in the past at points.

Agreed.

> 
>>  I think sometimes we forget that you can do svn revert.
> 
> I hate to have to do that. I don't think its a great way to handle this - we 
> could make everyone a committer at a drop of a hat and say we can just 
> revert. I wouldn't call for a revert except in exceptional circumstances. I 
> don't think that's the point.

Right, obviously I wasn't implying we'd want to do it, but we can if it is 
absolutely necessary.
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845432#action_12845432
 ] 

Michael McCandless commented on LUCENE-2312:


I don't see anything obviously wrong -- you excised this code from the same 
code that's used when merging the postings during flush?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845428#action_12845428
 ] 

Michael McCandless commented on LUCENE-2324:


{quote}
bq. Seems ilke it's 8 bytes

Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 
64bit+CompressedOops?)
{quote}

Right, and the pointer'd also be 8 bytes (but compact int stays at 4
bytes) so net/net on 64bit JRE savings would be 16-20 bytes per term.

Another thing we could do if we cutover to parallel arrays is to
switch to packed ints.  Many of these fields are horribly wasteful as
ints, eg docFreq or lastPosition.


> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845426#action_12845426
 ] 

Michael McCandless commented on LUCENE-2324:


bq. Hmm I think we'd need a separate hash. Otherwise you have to subclass 
PostingList for the different cases (freq. vs. non-frequent terms) and do 
instanceof checks? Or with the parallel arrays idea maybe we could encode more 
information in the dense ID? E.g. use one bit to indicate if that term occurred 
more than once.

Or 2 sets of parallel arrays (one for the singletons) or, something.

bq. So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 
bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth 
it?

And also the GC cost.

But it seems like specializing singleton fields will be the bigger win.

bq. I was wondering if it makes sense to make these kinds of experiments 
(pooling vs. non-pooling) with the flex code?

Last I tested (a while back now) indexing perf was the same -- need to
test again w/ recent changes (eg terms index is switching to packed
ints).  For pooling vs not I'd just do the experiment on trunk?

And most of this change (changing how postings data is buffered in
RAM) is "above" flex I expect.

But if for some reason you need to start changing index postings
format then you should probably do that on flex.


> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [DISCUSS] Do away with Contrib Committers and make core committers

2010-03-15 Thread Mark Miller


On 03/15/2010 08:33 AM, Grant Ingersoll wrote:

Right, Mark.  I think we would be effectively raising the bar to some extent 
for what it takes to be a committer.


That's part of my point though - some are contrib committers with a 
lower bar - now they are core/solr committers with that lower bar, but 
someone else that came along would not get to the same position now?



  We'd also be making contrib a first class citizen (not that it ever wasn't, 
but some people have that perception).


I think because it was kind of true. I could come along before and 
donate contrib x, and never show I worked well with the community or 
build up the merit needed to be a committer, and be made a contrib 
committer simply to maintain my module. That's happened plenty.



  Finally, I think we need to recognize that not everyone needs to be a 
McCandless in order to contribute in a helpful way.


We obviously recognize that or else I wouldn't be here! I think its more 
about fitting in - showing you get and follow the Apache way. Showing 
that ideas and changes you might push are in line with what the other 
committers thing is appropriate of a core/solr committer. Talent is not 
key here - community is. The bar for this has been *much* higher core 
than contrib in the past. And contrib has had different bars over time - 
I think it was even lower in the past at points.



  I think sometimes we forget that you can do svn revert.


I hate to have to do that. I don't think its a great way to handle this 
- we could make everyone a committer at a drop of a hat and say we can 
just revert. I wouldn't call for a revert except in exceptional 
circumstances. I don't think that's the point.



Obviously, we don't want to have to do it often, but it's not a huge deal if it 
happens.  We've all been there.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   


I also wouldn't personally cast my vote on this broadly - some people I 
might think should be core/solr committers now, others not. Merit at 
Apache is important - you never lose it. Seems weird to get something 
like that so easily when in the past you had to work your way to it from 
contrib committership and get voted on individually by the PMC.


Personally I'd prefer we just stop adding them, and the current ones 
work their way up like normal if they are so inclined, or the ones that 
are not even around anymore can just stay as they are.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845408#action_12845408
 ] 

Earwin Burrfoot commented on LUCENE-2324:
-

> Seems ilke it's 8 bytes
Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 
64bit+CompressedOops?)

Also, GC time is (roughly) linear in number of objects on heap, so replacing 
single huge array of objects with few huge primitive arrays for their fields 
does miracles to your GC delays.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845404#action_12845404
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

Pre-advanced apology for permanently damaging (well I guess it
can be deleted) the look and feel of this issue with a thwack of
code, however I don't want to post the messy patch, and I'm
guessing there's something small as to why the postings
iteration on the freq byte slice reader isn't happening
correctly (ie, it's returning 0).

{code}
public class DWTermDocs implements TermDocs {
final FreqProxTermsWriterPerField field;
final int numPostings;
final CharBlockPool charPool;
FreqProxTermsWriter.PostingList posting;
char[] text;
int textOffset;
private int postingUpto = -1;
final ByteSliceReader freq = new ByteSliceReader();
final ByteSliceReader prox = new ByteSliceReader();

int docID;
int termFreq;

DWTermDocs(FreqProxTermsWriterPerField field, 
FreqProxTermsWriter.PostingList posting) throws IOException {
  this.field = field;
  this.charPool = field.perThread.termsHashPerThread.charPool;
  //this.numPostings = field.termsHashPerField.numPostings;
  this.numPostings = 1;
  this.posting = posting;
  // nextTerm is called only once to 
  // set the term docs pointer at the 
  // correct position
  nextTerm();
}

boolean nextTerm() throws IOException {
  postingUpto++;
  if (postingUpto == numPostings)
return false;

  docID = 0;

  text = charPool.buffers[posting.textStart >> 
DocumentsWriter.CHAR_BLOCK_SHIFT];
  textOffset = posting.textStart & DocumentsWriter.CHAR_BLOCK_MASK;

  field.termsHashPerField.initReader(freq, posting, 0);
  if (!field.fieldInfo.omitTermFreqAndPositions)
field.termsHashPerField.initReader(prox, posting, 1);

  // Should always be true
  boolean result = nextDoc();
  assert result;

  return true;
}

public boolean nextDoc() throws IOException {
  if (freq.eof()) {
if (posting.lastDocCode != -1) {
  // Return last doc
  docID = posting.lastDocID;
  if (!field.omitTermFreqAndPositions)
termFreq = posting.docFreq;
  posting.lastDocCode = -1;
  return true;
} else
  // EOF
  return false;
  }
  final int code = freq.readVInt();
  if (field.omitTermFreqAndPositions)
docID += code;
  else {
docID += code >>> 1;
if ((code & 1) != 0)
  termFreq = 1;
else
  termFreq = freq.readVInt();
  }
  assert docID != posting.lastDocID;
  return true;
}
{code}

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845400#action_12845400
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
Sounds great - let's test it in practice.
{quote}

I have to admit that I need to catch up a bit on the flex branch.  I was 
wondering if it makes sense to make these kinds of experiments (pooling vs. 
non-pooling) with the flex code? Is it as fast as trunk already, or are there 
related nocommits left that affect indexing performance?  I would think not 
much of the flex changes should affect the in-memory indexing performance (in 
TermsHash*).


> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845398#action_12845398
 ] 

Michael Busch edited comment on LUCENE-2324 at 3/15/10 4:34 PM:


Reply to Mike's comment on LUCENE-2293: 
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12845263&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845263


{quote}
I think we can do even better, ie, that class wastes RAM for the single posting 
case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are 
not needed).

EG we could have a separate class dedicated to the singleton case. When term is 
first encountered it's enrolled there. We'd probably need a separate hash to 
store these (though not necessarily?). If it's seen again it's switched to the 
full posting.
{quote}

Hmm I think we'd need a separate hash.  Otherwise you have to subclass 
PostingList for the different cases (freq. vs. non-frequent terms) and do 
instanceof checks? Or with the parallel arrays idea maybe we could encode more 
information in the dense ID? E.g. use one bit to indicate if that term occurred 
more than once. 

{quote}
I mean instead of allocating an instance per unique term, we assign an integer 
ID (dense, ie, 0, 1, 2...).

And then we have an array for each member now in 
FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. 
Then to look up say the lastDocID for a given postingID you just get 
lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we 
can make these arrays paged... but that'd slow down each access.
{quote}

Yeah I like that idea. I've done something similar for representing trees - I 
had a very compact Node class with no data but such a dense ID, and arrays that 
stored the associated data.  Very easy to add another data type with no RAM 
overhead (you only use the amount of RAM the data needs).

Though, the price you pay is for dereferencing multiple times for each array?  
And how much RAM would we safe? The pointer for the PostingList object (4-8 
bytes), plus the size of the object header - how much is that in Java? 

Seems ilke it's 8 bytes: 
http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html

So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes 
(ID) = 8 bytes.  For fields with tons of unique terms that might be worth it?  

  was (Author: michaelbusch):
{quote}
I think we can do even better, ie, that class wastes RAM for the single posting 
case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are 
not needed).

EG we could have a separate class dedicated to the singleton case. When term is 
first encountered it's enrolled there. We'd probably need a separate hash to 
store these (though not necessarily?). If it's seen again it's switched to the 
full posting.
{quote}

Hmm I think we'd need a separate hash.  Otherwise you have to subclass 
PostingList for the different cases (freq. vs. non-frequent terms) and do 
instanceof checks? Or with the parallel arrays idea maybe we could encode more 
information in the dense ID? E.g. use one bit to indicate if that term occurred 
more than once. 

{quote}
I mean instead of allocating an instance per unique term, we assign an integer 
ID (dense, ie, 0, 1, 2...).

And then we have an array for each member now in 
FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. 
Then to look up say the lastDocID for a given postingID you just get 
lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we 
can make these arrays paged... but that'd slow down each access.
{quote}

Yeah I like that idea. I've done something similar for representing trees - I 
had a very compact Node class with no data but such a dense ID, and arrays that 
stored the associated data.  Very easy to add another data type with no RAM 
overhead (you only use the amount of RAM the data needs).

Though, the price you pay is for dereferencing multiple times for each array?  
And how much RAM would we safe? The pointer for the PostingList object (4-8 
bytes), plus the size of the object header - how much is that in Java? 

Seems ilke it's 8 bytes: 
http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html

So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes 
(ID) = 8 bytes.  For fields with tons of unique terms that might be worth it?  
  
> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845398#action_12845398
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
I think we can do even better, ie, that class wastes RAM for the single posting 
case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are 
not needed).

EG we could have a separate class dedicated to the singleton case. When term is 
first encountered it's enrolled there. We'd probably need a separate hash to 
store these (though not necessarily?). If it's seen again it's switched to the 
full posting.
{quote}

Hmm I think we'd need a separate hash.  Otherwise you have to subclass 
PostingList for the different cases (freq. vs. non-frequent terms) and do 
instanceof checks? Or with the parallel arrays idea maybe we could encode more 
information in the dense ID? E.g. use one bit to indicate if that term occurred 
more than once. 

{quote}
I mean instead of allocating an instance per unique term, we assign an integer 
ID (dense, ie, 0, 1, 2...).

And then we have an array for each member now in 
FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. 
Then to look up say the lastDocID for a given postingID you just get 
lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we 
can make these arrays paged... but that'd slow down each access.
{quote}

Yeah I like that idea. I've done something similar for representing trees - I 
had a very compact Node class with no data but such a dense ID, and arrays that 
stored the associated data.  Very easy to add another data type with no RAM 
overhead (you only use the amount of RAM the data needs).

Though, the price you pay is for dereferencing multiple times for each array?  
And how much RAM would we safe? The pointer for the PostingList object (4-8 
bytes), plus the size of the object header - how much is that in Java? 

Seems ilke it's 8 bytes: 
http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html

So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes 
(ID) = 8 bytes.  For fields with tons of unique terms that might be worth it?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency


[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845391#action_12845391
 ] 

Michael Busch commented on LUCENE-2293:
---

I'll reply on LUCENE-2324.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Welcome new committers!

2010-03-15 Thread Michael Busch


Welcome guys! :)

Sounds really like some great progress in such a short time!

 Michael

On 3/15/10 8:25 AM, Michael McCandless wrote:

The merge of Solr and Lucene dev is well underway... Lucene already
has a bunch of new committers... welcome aboard!

And overnight tons of work was done (and beer, espresso and tea,
depending on your timezone, consumed ;) and now we already
have a branch where Solr has been upgraded to Lucene's trunk JARs:

   https://svn.apache.org/repos/asf/lucene/solr/branches/solr

Wonderfully, this then made testing the flex branch against Solr
simple, which Robert did, thus uncovering a couple back-compat issues
that otherwise would've remained hidden...

Great progress already!

Of course there's still much to do going forward...devil is in the
details, but it's great to have us all on the same team.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2311) Pass potent SR to IRWarmer.warm(), and also call warm() for new segments


 [ 
https://issues.apache.org/jira/browse/LUCENE-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2311:
--

Assignee: Michael McCandless

> Pass potent SR to IRWarmer.warm(), and also call warm() for new segments
> 
>
> Key: LUCENE-2311
> URL: https://issues.apache.org/jira/browse/LUCENE-2311
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Currently warm() receives a SegmentReader without terms index and docstores.
> It would be arguably more useful for the app to receive a fully loaded 
> reader, so it can actually fire up some caches. If the warmer is undefined on 
> IW, we probably leave things as they are.
> It is also arguably more concise and clear to call warm() on all newly 
> created segments, so there is a single point of warming readers in NRT 
> context, and every subreader coming from getReader is guaranteed to be warmed 
> up -> you don't have to introduce even more mess in your code by rechecking 
> it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845374#action_12845374
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

{quote}Good question on skipping - for first cut we can have no
skipping (and just scan)? {quote}

True.

One immediate thought is to have a set skip interval (what was
it before when we had single level?), and for now at least have
a single level skip list. That we can grow the posting list with
docs, and the skip list at the same time. If the interval is
constant there won't be a need to rebuild the skip list.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig


[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845370#action_12845370
 ] 

Michael McCandless commented on LUCENE-2320:


bq. Or, maybe, we should think of MergePolicy API that doesn't require one to 
keep a reference to IW?

Looks like IW is used pretty widely: for messaging (when infoStream is set), 
for retrieving the merges, for getting the Directory, and for getting number of 
deleted docs for a given segment.  I guess an option would be to simply pass it 
around everywhere.  Then we wouldn't have to break the circular dependendy.

This is what MergeScheduler appears to do -- it's passed to .merge, and then 
each bg thread in CMS holds a reference to the writer (since it needs to ask 
for followon merges).

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Welcome new committers!

2010-03-15 Thread Michael McCandless

The merge of Solr and Lucene dev is well underway... Lucene already
has a bunch of new committers... welcome aboard!

And overnight tons of work was done (and beer, espresso and tea,
depending on your timezone, consumed ;) and now we already
have a branch where Solr has been upgraded to Lucene's trunk JARs:

  https://svn.apache.org/repos/asf/lucene/solr/branches/solr

Wonderfully, this then made testing the flex branch against Solr
simple, which Robert did, thus uncovering a couple back-compat issues
that otherwise would've remained hidden...

Great progress already!

Of course there's still much to do going forward...devil is in the
details, but it's great to have us all on the same team.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2297) IndexWriter should let you optionally enable reader pooling


 [ 
https://issues.apache.org/jira/browse/LUCENE-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2297:
---

Attachment: LUCENE-2297.patch

Adds IWC.set/getReaderPooling.

> IndexWriter should let you optionally enable reader pooling
> ---
>
> Key: LUCENE-2297
> URL: https://issues.apache.org/jira/browse/LUCENE-2297
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2297.patch
>
>
> For apps using a large index and frequently need to commit and resolve 
> deletes, the cost of opening the SegmentReaders on demand for every commit 
> can be prohibitive.
> We an already pool readers (NRT does so), but, we only turn it on if NRT 
> readers are in use.
> We should allow separate control.
> We should do this after LUCENE-2294.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2320) Add MergePolicy to IndexWriterConfig


 [ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2320:
--

Assignee: Michael McCandless

> Add MergePolicy to IndexWriterConfig
> 
>
> Key: LUCENE-2320
> URL: https://issues.apache.org/jira/browse/LUCENE-2320
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2320.patch
>
>
> Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
> well. The change is not straightforward and so I've kept it for a separate 
> issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
> passed to it before an IndexWriter actually exists. And today IW may create 
> an MP just for it to be overridden by the application one line afterwards. I 
> don't want to make iw member of MP non-final, or settable by extending 
> classes, however it needs to remain protected so they can access it directly. 
> So the proposed changes are:
> * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
> once (hence its name). It'll have the signature SetOnce w/ *synchronized 
> set* and *T get()*. T will be declared volatile, so that get() won't be 
> synchronized.
> * MP will define a *protected final SetOnce writer* instead of 
> the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
> * MP will offer a public default ctor, together with a set(IndexWriter).
> * IndexWriter will set itself on MP using set(this). Note that if set will be 
> called more than once, it will throw an exception (AlreadySetException - or 
> does someone have a better suggestion, preferably an already existing Java 
> exception?).
> That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
> review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2325) investigate solr test failures using flex


[ 
https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845332#action_12845332
 ] 

Michael McCandless commented on LUCENE-2325:


So awesome that we are at the point where we can do this!  Thanks Robert...

> investigate solr test failures using flex
> -
>
> Key: LUCENE-2325
> URL: https://issues.apache.org/jira/browse/LUCENE-2325
> Project: Lucene - Java
>  Issue Type: Test
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Michael McCandless
> Fix For: Flex Branch
>
> Attachments: LUCENE-2325.patch
>
>
> We have a branch of Solr located here: 
> https://svn.apache.org/repos/asf/lucene/solr/branches/solr
> Currently all the tests pass with lucene trunk jars.
> I plopped in the flex jars and they do not, so I thought these might be 
> interesting to look at.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2325) investigate solr test failures using flex


 [ 
https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2325:
--

Assignee: Michael McCandless

> investigate solr test failures using flex
> -
>
> Key: LUCENE-2325
> URL: https://issues.apache.org/jira/browse/LUCENE-2325
> Project: Lucene - Java
>  Issue Type: Test
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Michael McCandless
> Fix For: Flex Branch
>
> Attachments: LUCENE-2325.patch
>
>
> We have a branch of Solr located here: 
> https://svn.apache.org/repos/asf/lucene/solr/branches/solr
> Currently all the tests pass with lucene trunk jars.
> I plopped in the flex jars and they do not, so I thought these might be 
> interesting to look at.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: How can I use QueryScorer() to find only perfect matches??

2010-03-15 Thread Erick Erickson

Try +contents:term +contents:query. By misplacing the
'+' you're getting the default OR operator and the '+'
is probably being thrown away by the analyzer.

Luke will help here a lot.

HTH
Erick

On Mon, Mar 15, 2010 at 9:46 AM, christian stadler  wrote:

> Hi there,
>
> I have an issue with the QueryScorer(query) method at the moment and I need
> some assistance.
> I was indexing my e-book "lucene in action" and based on this index-db I
> started to play around with some boolean queries like:
> (contents:+term contents:+query)
> As a result I'm expecting as a perfect match for the phrase "term query"
> four
> hits.
>
> But when I run my sample to highlight this phrase in the context then I get
> a
> lot more results. It also finds all the matches for "term" and "query"
> independently.
>
> I think the problem is the QueryScorer() which softens the former exact
> boolean
> query.
> Then I was trying the following:
> private static Highlighter GetHits(Query query, Formatter formatter)
> {
>string filed = "contents"
>BooleanQuery termsQuery = new BooleanQuery();
>
>WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true, field);
>foreach (WeightedTerm term in terms)
>{
>TermQuery termQuery = new TermQuery(new Term(field,
> term.GetTerm()));
>termsQuery.Add(termQuery, BooleanClause.Occur.MUST);
>}
>
>// create query scorer based on term queries (field specific)
>QueryScorer scorer = new QueryScorer(termsQuery);
>
>Highlighter highlighter = new Highlighter(formatter, scorer);
>highlighter.SetTextFragmenter(new SimpleFragmenter(20));
>
>return highlighter;
> }
> to rewrite the query and set the term attribute from SHOULD to MUST
>
> But the result was the same.
> Do you have any example how I can use the QueryScorer() in exactly the same
> way
> as to mimic a BooleanSearch??
>
> thanks in advance
> Christian
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

[jira] Resolved: (LUCENE-2293) IndexWriter has hard limit on max concurrency


 [ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2293.


Resolution: Fixed

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2325) investigate solr test failures using flex

2010-03-15 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2325:


Attachment: LUCENE-2325.patch

attached is a very small patch to the Solr branch so it will compile against 
flex jars.

> investigate solr test failures using flex
> -
>
> Key: LUCENE-2325
> URL: https://issues.apache.org/jira/browse/LUCENE-2325
> Project: Lucene - Java
>  Issue Type: Test
>Affects Versions: Flex Branch
>Reporter: Robert Muir
> Fix For: Flex Branch
>
> Attachments: LUCENE-2325.patch
>
>
> We have a branch of Solr located here: 
> https://svn.apache.org/repos/asf/lucene/solr/branches/solr
> Currently all the tests pass with lucene trunk jars.
> I plopped in the flex jars and they do not, so I thought these might be 
> interesting to look at.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2325) investigate solr test failures using flex

2010-03-15 Thread Robert Muir (JIRA)

investigate solr test failures using flex
-

 Key: LUCENE-2325
 URL: https://issues.apache.org/jira/browse/LUCENE-2325
 Project: Lucene - Java
  Issue Type: Test
Affects Versions: Flex Branch
Reporter: Robert Muir
 Fix For: Flex Branch
 Attachments: LUCENE-2325.patch

We have a branch of Solr located here: 
https://svn.apache.org/repos/asf/lucene/solr/branches/solr

Currently all the tests pass with lucene trunk jars.

I plopped in the flex jars and they do not, so I thought these might be 
interesting to look at.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

How can I use QueryScorer() to find only perfect matches??

2010-03-15 Thread christian stadler

Hi there,

I have an issue with the QueryScorer(query) method at the moment and I need
some assistance.
I was indexing my e-book "lucene in action" and based on this index-db I 
started to play around with some boolean queries like:
(contents:+term contents:+query)
As a result I'm expecting as a perfect match for the phrase "term query" four
hits.

But when I run my sample to highlight this phrase in the context then I get a
lot more results. It also finds all the matches for "term" and "query"
independently.

I think the problem is the QueryScorer() which softens the former exact boolean
query.
Then I was trying the following:
private static Highlighter GetHits(Query query, Formatter formatter)
{
string filed = "contents"
BooleanQuery termsQuery = new BooleanQuery();

WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true, field);
foreach (WeightedTerm term in terms)
{
TermQuery termQuery = new TermQuery(new Term(field, term.GetTerm()));
termsQuery.Add(termQuery, BooleanClause.Occur.MUST);
}

// create query scorer based on term queries (field specific)
QueryScorer scorer = new QueryScorer(termsQuery);

Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(new SimpleFragmenter(20));

return highlighter;
}
to rewrite the query and set the term attribute from SHOULD to MUST

But the result was the same.
Do you have any example how I can use the QueryScorer() in exactly the same way
as to mimic a BooleanSearch??

thanks in advance
Christian




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-15 Thread Robert Muir

>>> But I don't like baking in search concepts at index time...
>>
> Many scoring models are possible if you store enough stats in the
> index.
>

in general the missing stats seem to fit in two buckets/categories:

1) length normalization pivot: average length in bytes, terms, unique terms
2) term frequency normalization factor: max or average tf for the field.

you never need more than one of each category for the same field. one
approach would be for the search-time similarity to simply use these
generic names (i guess they could get some placeholder value if they
are not available) and at index time, you make sure you put the one
you want (or none at all) in the "bucket"


-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [DISCUSS] Do away with Contrib Committers and make core committers

2010-03-15 Thread Grant Ingersoll

On Mar 14, 2010, at 6:47 PM, Mark Miller wrote:

> 
> 
> On 03/14/2010 06:37 PM, Grant Ingersoll wrote:
>> On Mar 14, 2010, at 2:03 PM, Uwe Schindler wrote:
>> 
>>   
>>> This time a +1 without discuss :-)
>>> 
>> Yeah, but Uwe, the thread was DISCUSS, not VOTE!  :-)
>>   
> 
> I had a whole spiel about earning merit, and some contrib committers were 
> made contrib committers for just a single contrib, some long ago, didn't have 
> to necessarily show they understood/followed the apache way, lower bar (not 
> necessarily from talent perspective, but you might be made a contrib 
> committer just to maintain the code module you contributed, whether you 
> worked with the community or not), etc, etc. But ah, since everyone is into 
> it without discussion, far be it from me to stand against. And I got my spiel 
> in (super condensed) anyway now. With everyone else into it so far, I just 
> look foolish trying to discuss :)

Right, Mark.  I think we would be effectively raising the bar to some extent 
for what it takes to be a committer.  We'd also be making contrib a first class 
citizen (not that it ever wasn't, but some people have that perception).  
Finally, I think we need to recognize that not everyone needs to be a 
McCandless in order to contribute in a helpful way.  I think sometimes we 
forget that you can do svn revert.  Obviously, we don't want to have to do it 
often, but it's not a huge deal if it happens.  We've all been there.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [DISCUSS] Do away with Contrib Committers and make core committers

2010-03-15 Thread Grant Ingersoll


On Mar 14, 2010, at 8:25 PM, Yonik Seeley wrote:

> On Sun, Mar 14, 2010 at 5:47 PM, Mark Miller  wrote:
>> On 03/14/2010 06:37 PM, Grant Ingersoll wrote:
>>> 
>>> On Mar 14, 2010, at 2:03 PM, Uwe Schindler wrote:
>>> 
>>> 
 
 This time a +1 without discuss :-)
 
>>> 
>>> Yeah, but Uwe, the thread was DISCUSS, not VOTE!  :-)
>>> 
>> 
>> I had a whole spiel about earning merit, and some contrib committers were
>> made contrib committers for just a single contrib, some long ago, didn't
>> have to necessarily show they understood/followed the apache way, lower bar
>> (not necessarily from talent perspective, but you might be made a contrib
>> committer just to maintain the code module you contributed, whether you
>> worked with the community or not), etc, etc.
> 
> Hmmm, yeah - when it is time to VOTE, there are actually two different
> questions here:

Agreed.

> 1) if lucene should move away from contrib committers, adding no new ones

Yes, this is what I'm thinking.  All future committers would be based on 
contributions to the project and there would be no distinction between 
contrib/core.

> 2) if all existing contrib committers should immediately become core
> lucene/solr committers, or if that promotion should proceed in the
> normal fashion as it has in the past.

I'm fine w/ all of them, except we might want to check to see if it has been 
more than a year of contributing and ask any of them if they want to be 
Emeritus.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-15 Thread Michael McCandless

On Mon, Mar 15, 2010 at 12:03 AM, Marvin Humphrey
 wrote:
> On Sat, Mar 13, 2010 at 06:41:26AM -0500, Michael McCandless wrote:
>
>> I still don't think similarity should have any bearing during indexing.
>
> Similarity has always, from day one, affected the contents of the index.  This
> idea that it should be totally divorced from indexing is, in fact, a very
> significant change that you are proposing for Lucene, and it will require
> non-trivial changes to the file format.

I agree.  Instead of storing byte per doc I'm proposing storing the
raw stats and letting Sim compute that byte at search time.  We can
also allow that Sim to cache stuff (boost bytes, if it uses them) to
make startup faster, eventually.

> For starters, you're going to at least double the footprint of the norms.  For
> fields with more than 127 tokens or 127 unique terms, the increase will be
> greater... and if the user sets doc-boost and field-boost in a pattern that
> defies RLE compression, the footprint will be greater still.

On disk, yes.  In memory, no (assuming your Sim impl encodes boost as byte).

> I happen to think that limited search-time settability of Similarity offers a
> nice feature -- the ability to futz with different weighting models and length
> normalization settings without reindexing -- and that it's worth exploring in
> pursuit of this feature.
>
> But by opting to forego the lossy compression now performed by encodeNorm() at
> index-time and store precursor statistics instead, we are going to take a hit
> on index size even with lossless compression.

I think it's worth letting the custom Sim cache stuff [privately] on
disk, ie the byte norms, eventually.

> Furthermore, delaying Similarity choice means that it becomes the user's
> responsibility to ensure that index-time Codec choice is compatible with
> search-time Similarity choice.  In contrast, setting Similarity at index-time
> means that the core gets to pick the Codec and can ensure that all the
> necessary data gets encoded, sparing the user from having to understand the
> gory details of posting formats.

Yeah this is the part I struggle with -- how to make index-time field
options "intelligible".  But I think good defaulting does 90% of the
work.  The remaining 10% can work backwards from their search needs to
what must be done at indexing.

> In summary, I think search-time setting of Similarity is a nice feature but a
> poor requirement.  I'm not persuaded that this proposal to banish Similarity
> from index-time is wise.

OK I think we just differ...

>> But I don't like baking in search concepts at index time...
>
> Then you ought to use a traditional RDBMS rather than an indexing engine, and
> make sure you don't put indexes on any of the fields in your tables.  :)
>
> Or maybe an RDBMS has too many search concepts baked in, and a flat file would
> be best.  :)
>
> Seriously... optimizing on-disk data structures to accommodate anticipated
> search query patterns and maximize speed and relevance... that's what
> indexing's all about, ain't it?

You're over-reading into what I said.

I mean specifically one should not have to commit to the precise
scoring model they will use for a given field, when they index that
field.

Many scoring models are possible if you store enough stats in the
index.

> And what class other than Similarity knows enough about the scoring algorithm
> to perform these data reduction tasks?  If it's not goint to be Similarity
> itself, it has to be something that know absolutely everything about the
> Similarity implementation's scoring model.

I don't follow this...

It will be Sim that does computes norm bytes.

I mean, other classes can go and look @ these stats if they want,
too... users will come up with neat uses over time :)

>> > Right.  However, now that I've thought about it, if a user indicates that a
>> > field is "match-only" by supplying a MatchSimilarity, we know that we can
>> > omit boost bytes.
>> >
>> > So we can re-conceive "MatchSimilarity" as being analogous to omitNorms.
>> > Huzzah!
>> >
>> > One down, one to go.  :)
>>
>> Hmm except shouldn't you allow omitting boost bytes but keeping term
>> freqs?  Ie all docs are roughly the same length (say, a title field)
>> and I never boost them?  How will you allow this?
>
> I think that you've described an uncommon use case, and it's tempting to just
> wave it off with the easy answer: you spec a Sim that writes such a format.

I don't think this is so uncommon?  (This is the omitNorms case in
Lucene today, except you still gotta index positions, until we decouple
the two = LUCENE-2048.  Such a nice round binary number for
remembering...).

> But here's where maybe Lucy can steal from the Lucene flex branch.

Yay: poaching!

> We can give Similarity a makePostingCodec() factory method.  Then,
> we can publish common PostingCodecs as public classes, allowing us
> to support different formats with minimal effort.
>
>  class MySim extends Similarity {
>

[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency


[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845263#action_12845263
 ] 

Michael McCandless commented on LUCENE-2293:


bq.  For example, currently a nice optimization would be to store the first 
posting in the PostingList object and only allocate slices once you see the 
second occurrence (similar to the pulsing codec)?

I think we can do even better, ie, that class wastes RAM for the single posting 
case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are 
not needed).

EG we could have a separate class dedicated to the singleton case.  When term 
is first encountered it's enrolled there.  We'd probably need a separate hash 
to store these (though not necessarily?).  If it's seen again it's switched to 
the full posting.

bq. What exactly do you mean with parallel arrays? Parallel to the termHash 
array? Then the termsHash array would not be an array of PostingList objects 
anymore, but an array of pointers into the char[] array? And you'd have e.g. a 
parallel int[] array for df, another int[] for pointers into the postings byte 
pool, etc? Something like that?

I mean instead of allocating an instance per unique term, we assign an integer 
ID (dense, ie, 0, 1, 2...).

And then we have an array for each member now in 
FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc.  
Then to look up say the lastDocID for a given postingID you just get 
lastDocIDs[postingID].  If we're worried about oversize allocation overhead, we 
can make these arrays paged... but that'd slow down each access.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845261#action_12845261
 ] 

Michael McCandless commented on LUCENE-2324:


Sounds great -- let's test it in practice.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845257#action_12845257
 ] 

Michael McCandless commented on LUCENE-2312:


{quote}
I got the basics of the term enum working, it can be completed
fairly easily. So I moved on to term docs... There we got some
work to do? Because we're not storing the skip lists in the ram
buffer, currently. I guess we'll need a new
FreqProxTermsWriterPerField that stores the skip lists as
they're being written? How will that work? Doesn't the
multi-level skip list assume a set number of docs?
{quote}

Sounds like you & Michael should sync up!

Good question on skipping -- for first cut we can have no skipping
(and just scan)?  Skipping may not be that important in practice,
unless RAM buffer becomes truly immense.  Of course, the tinier the
docs the more important skipping will be...


> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer