subject:"Re\: Realtime Search"

That could very well be, but I was referencing your statement:

1) Design index formats that can be memory mapped rather than slurped,
 bringing the cost of opening/reopening an IndexReader down to a
 negligible level.

The only reason to do this (or have it happen) is if you perform a binary 
search on the term index.

Using a 2 file system is going to be WAY slower - I'll bet lunch. It might be 
workable if the files were on a striped drive, or put each file on a different 
drive/controller, but requiring such specially configured hardware is not a 
good idea. In the common case (single drive), you are going to be seeking all 
over the place.

Saving the memory structure from the write of the segment is going to offer far 
superior performance - you can binary seek on the memory structure, not the 
mmap file. The only problem with this is that there is going to be a minimum 
memory requirement.

Also, the mmap is only suitable for 64 bit platforms, since there is no way in 
Java to unmap, you are going to run out of address space as segments are 
rewritten.









-Original Message-
From: Marvin Humphrey mar...@rectangular.com
Sent: Dec 24, 2008 1:31 PM
To: java-dev@lucene.apache.org
Subject: Re: Realtime Search

On Wed, Dec 24, 2008 at 12:02:24PM -0600, robert engels wrote:
 As I understood this discussion though, it was an attempt to remove  
 the in memory 'skip to' index, to avoid the reading of this during  
 index open/reopen.

No.  That idea was entertained briefly and quickly discarded.  There seems to
be an awful lot of irrelevant noise in the current thread arising due to lack
of familiarity with the ongoing discussions in JIRA.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

Also, if you are really set on the mmap strategy, why not use the single file 
with fixed length pages, using the header I proposed (and key compression). You 
don't need any fancy partial page stuff, just waste a small amount of space at 
the end of pages.

I think this is going to far faster than a file of fixed length offsets (I 
assume you would also put the entry data length in file #1 as well), and a file 
of data (file #2). Mainly because the final page(s) can be more efficiently 
searched, and since you can use compression (since you have pages), the files 
are going to be significantly smaller (improving the write time, and the cache 
efficiency).


-Original Message-
From: Robert Engels reng...@ix.netcom.com
Sent: Dec 26, 2008 11:30 AM
To: java-dev@lucene.apache.org, java-dev@lucene.apache.org
Subject: Re: Realtime Search

That could very well be, but I was referencing your statement:

1) Design index formats that can be memory mapped rather than slurped,
 bringing the cost of opening/reopening an IndexReader down to a
 negligible level.

The only reason to do this (or have it happen) is if you perform a binary 
search on the term index.

Using a 2 file system is going to be WAY slower - I'll bet lunch. It might be 
workable if the files were on a striped drive, or put each file on a different 
drive/controller, but requiring such specially configured hardware is not a 
good idea. In the common case (single drive), you are going to be seeking all 
over the place.

Saving the memory structure from the write of the segment is going to offer 
far superior performance - you can binary seek on the memory structure, not 
the mmap file. The only problem with this is that there is going to be a 
minimum memory requirement.

Also, the mmap is only suitable for 64 bit platforms, since there is no way in 
Java to unmap, you are going to run out of address space as segments are 
rewritten.









-Original Message-
From: Marvin Humphrey mar...@rectangular.com
Sent: Dec 24, 2008 1:31 PM
To: java-dev@lucene.apache.org
Subject: Re: Realtime Search

On Wed, Dec 24, 2008 at 12:02:24PM -0600, robert engels wrote:
 As I understood this discussion though, it was an attempt to remove  
 the in memory 'skip to' index, to avoid the reading of this during  
 index open/reopen.

No.  That idea was entertained briefly and quickly discarded.  There seems to
be an awful lot of irrelevant noise in the current thread arising due to lack
of familiarity with the ongoing discussions in JIRA.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-26 Thread Doug Cutting


Michael McCandless wrote:

So then I think we should start with approach #2 (build real-time on
top of the Lucene core) and iterate from there.  Newly added docs go
into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
deleted docs record the delete against the right SegmentReader (and
LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).

I would take the simple approach first: use ordinary SegmentReader on
a RAMDirectory for the tiny segments.  If that proves too slow, swap
in Memory/InstantiatedIndex for the tiny segments.  If that proves too
slow, build a reader impl that reads from DocumentsWriter RAM buffer.


+1 This sounds like a good approach to me.  I don't see any fundamental 
reasons why we need different representations, and fewer implementations 
of IndexWriter and IndexReader is generally better, unless they get way 
too hairy.  Mostly it seems that real-time can be done with our existing 
toolbox of datastructures, but with some slightly different control 
structures.  Once we have the control structure in place then we should 
look at optimizing data structures as needed.


Doug

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-26 Thread J. Delgado

The addition of docs into tiny segments using the current data structures
seems the right way to go. Sometime back one of my engineers implemented
pseudo real-time using MultiSearcher by having an in-memory (RAM based)
short-term index that auto-merged into a disk-based long term index that
eventually get merged into archive indexes. Index optimization would take
place during these merges. The search we required was very time-sensitive
(searching last-minute breaking news wires). The advantage of having an
archive index is that very old documents in our applications were not
usually searched on unless archives were explicitely selected.

-- Joaquin

On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting cutt...@apache.org wrote:

 Michael McCandless wrote:

 So then I think we should start with approach #2 (build real-time on
 top of the Lucene core) and iterate from there.  Newly added docs go
 into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
 deleted docs record the delete against the right SegmentReader (and
 LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).

 I would take the simple approach first: use ordinary SegmentReader on
 a RAMDirectory for the tiny segments.  If that proves too slow, swap
 in Memory/InstantiatedIndex for the tiny segments.  If that proves too
 slow, build a reader impl that reads from DocumentsWriter RAM buffer.


 +1 This sounds like a good approach to me.  I don't see any fundamental
 reasons why we need different representations, and fewer implementations of
 IndexWriter and IndexReader is generally better, unless they get way too
 hairy.  Mostly it seems that real-time can be done with our existing toolbox
 of datastructures, but with some slightly different control structures.
  Once we have the control structure in place then we should look at
 optimizing data structures as needed.

 Doug


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-26 Thread Marvin Humphrey

On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote:
   4) Allow 2 concurrent writers: one for small, fast updates, and one for
  big background merges.
 
 Marvin can you describe more detail here? 

The goal is to improve worst-case write performance.  

Currently, writes are quick most of the time, but occassionally you'll trigger
a big merge and get stuck.  To solve this problem, we can assign a merge
policy to our primary writer which tells it to merge no more than
mergeThreshold documents.  The value of mergeTheshold will need tuning
depending on document size, change rate, and so on, but the idea is that we
want this writer to do as much merging as it can while still keeping
worst-case write performance down to an acceptable number.

Doing only small merges just puts off the day of reckoning, of course.  By
avoiding big consolidations, we are slowly accumulating small-to-medium sized
segments and causing a gradual degradation of search-time performance.

What we'd like is a separate write process, operating (mostly) in the
background, dedicated solely to merging segments which contain at least
mergeThreshold docs.

If all we have to do is add documents to the index, adding that second write
process isn't a big deal.  We have to worry about competion for segment,
snapshot, and temp file names, but that's about it.

Deletions make matters more complicated, but with a tombstone-based deletions
mechanism, the problems are solvable.

When the background merge writer starts up, it will see a particular view of
the index in time, including deletions.  It will perform nearly all of its
operations based on this view of the index, mapping around documents which
were marked as deleted at init time.

In between the time when the background merge writer starts up and the time it
finishes consolidating segment data, we assume that the primary writer will
have modified the index.

  * New docs have been added in new segments.
  * Tombstones have been added which suppress documents in segments which
didn't even exist when the background merge writer started up.
  * Tombstones have been added which suppress documents in segments which
existed when the background merge writer started up, but were not merged.
  * Tombstones have been added which suppress documents in segments which have
just been merged.

Only the last category of deletions matters.

At this point, the background merge writer aquires an exclusive write lock on
the index.  It examines recently added tombstones, translates the document
numbers and writes a tombstone file against itself.  Then it writes the
snapshot file to commit its changes and releases the write lock.

Worst case update performance for the system is now the sum of the time it
takes the background merge writer consolidate tombstones and worst-case
performance of the primary writer.

 It sounds like this is your solution for decoupling segments changes due
 to merges from changes from docs being indexed, from a reader's standpoint?

It's true that we are decoupling the process of making logical changes to the
index from the process of internal consolidation.  I probably wouldn't
describe that as being done from the reader's standpoint, though.

With mmap and data structures optimized for it, we basically solve the
read-time responsiveness cost problem.  From the client perspective, the delay
between firing off a change order and seeing that change made live is now
dominated by the time it takes to actually update the index.  The time between
the commit and having an IndexReader which can see that commit is negligible
in comparision.

 Since you are using mmap to achieve near zero brand-new IndexReader
 creation, whereas in Lucene we are moving towards achieving real-time
 by always reopening a current IndexReader (not a brand new one), it
 seems like you should not actually have to worry about the case of
 reopening a reader after a large merge has finished?

Even though we can rely on mmap rather than slurping, there are potentially a
lot of files to open and a lot of JSON-encoded metadata to parse, so I'm not
certain that Lucy/KS will never have to worry about the time it takes to open
a new IndexReader.  Fortunately, we can implement reopen() if we need to.

 We need to deal with this case (background the warming) because
 creating that new SegmentReader (on the newly merged segment) can take
 a non-trivial amount of time.

Yes.  Without mmap or some other solution, I think improvements to worst-case
update performance in Lucene will continue to be constrained by post-commit
IndexReader opening costs.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-26 Thread J. Delgado

One thing that I forgot to mention is that in our implementation the
real-time indexing took place with many folder-based listeners writing  to
many  tiny in-memory indexes partitioned by sub-sources with fewer
long-term and archive indexes per box. Overall distributed search across
various lucene-based search services was done using a federator component,
very much like shard based searches is done today (I believe).

-- Joaquin.
l


On Fri, Dec 26, 2008 at 10:48 AM, J. Delgado joaquin.delg...@gmail.comwrote:

 The addition of docs into tiny segments using the current data structures
 seems the right way to go. Sometime back one of my engineers implemented
 pseudo real-time using MultiSearcher by having an in-memory (RAM based)
 short-term index that auto-merged into a disk-based long term index that
 eventually get merged into archive indexes. Index optimization would take
 place during these merges. The search we required was very time-sensitive
 (searching last-minute breaking news wires). The advantage of having an
 archive index is that very old documents in our applications were not
 usually searched on unless archives were explicitely selected.

 -- Joaquin


 On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting cutt...@apache.org wrote:

 Michael McCandless wrote:

 So then I think we should start with approach #2 (build real-time on
 top of the Lucene core) and iterate from there.  Newly added docs go
 into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
 deleted docs record the delete against the right SegmentReader (and
 LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).

 I would take the simple approach first: use ordinary SegmentReader on
 a RAMDirectory for the tiny segments.  If that proves too slow, swap
 in Memory/InstantiatedIndex for the tiny segments.  If that proves too
 slow, build a reader impl that reads from DocumentsWriter RAM buffer.


 +1 This sounds like a good approach to me.  I don't see any fundamental
 reasons why we need different representations, and fewer implementations of
 IndexWriter and IndexReader is generally better, unless they get way too
 hairy.  Mostly it seems that real-time can be done with our existing toolbox
 of datastructures, but with some slightly different control structures.
  Once we have the control structure in place then we should look at
 optimizing data structures as needed.

 Doug


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

This is what we mostly do, but we serialize the documents to a log file first, 
so if server crashes before the background merge of the RAM segments into the 
disk segments completes, we can replay the operations on server restart. Since 
the serialize is a sequential write to an already open file, it is very fast.

I realize that many users do not wrap Lucene in a server process, so it doesn't 
seem that writing only to the RAM segments will work? How will the other 
processes/servers see them?  Doesn't seem it would be real-time for them.

Maybe restrict the real-time search to server Lucene installations? If you 
are concerned about performance in the first place, seems a requirement anyway.

On this note, maybe to allow greater advancement of Lucene, Lucene should move 
to a design approach similar to many databases.  You have an embedded version, 
which is designed for single process with multiple threads, and a server 
version which wraps the embedded version allowing multiple clients. Seems to be 
a far simpler architecture. I know I addressed have brought this up in the 
past, but maybe time to revisit?  It was the core of unix design (no file locks 
needed), and works well for many dbs (i.e. derby)








-Original Message-
From: Doug Cutting cutt...@apache.org
Sent: Dec 26, 2008 12:20 PM
To: java-dev@lucene.apache.org
Subject: Re: Realtime Search

Michael McCandless wrote:
 So then I think we should start with approach #2 (build real-time on
 top of the Lucene core) and iterate from there.  Newly added docs go
 into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
 deleted docs record the delete against the right SegmentReader (and
 LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).
 
 I would take the simple approach first: use ordinary SegmentReader on
 a RAMDirectory for the tiny segments.  If that proves too slow, swap
 in Memory/InstantiatedIndex for the tiny segments.  If that proves too
 slow, build a reader impl that reads from DocumentsWriter RAM buffer.

+1 This sounds like a good approach to me.  I don't see any fundamental 
reasons why we need different representations, and fewer implementations 
of IndexWriter and IndexReader is generally better, unless they get way 
too hairy.  Mostly it seems that real-time can be done with our existing 
toolbox of datastructures, but with some slightly different control 
structures.  Once we have the control structure in place then we should 
look at optimizing data structures as needed.

Doug

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

If you move to the either embedded, or server model, the post reopen is 
trivial, as the structures can be created as the segment is written.

It is the networked shared access model that causes a lot of these 
optimizations to be far more complex than needed.

Would it maybe be simpler to move the embedded or server model, and add a 
network shared file (e.g. nfs) access model as a layer?  The latter is going to 
perform far worse anyway.

I guess I don't understand why Lucene continues to try and support this model. 
NO ONE does it any more.  This is the way MS Access worked, and everyone that 
wanted performance needed to move to SQL server for the server model.


-Original Message-
From: Marvin Humphrey mar...@rectangular.com
Sent: Dec 26, 2008 12:53 PM
To: java-dev@lucene.apache.org
Subject: Re: Realtime Search

On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote:
   4) Allow 2 concurrent writers: one for small, fast updates, and one for
  big background merges.
 
 Marvin can you describe more detail here? 

The goal is to improve worst-case write performance.  

Currently, writes are quick most of the time, but occassionally you'll trigger
a big merge and get stuck.  To solve this problem, we can assign a merge
policy to our primary writer which tells it to merge no more than
mergeThreshold documents.  The value of mergeTheshold will need tuning
depending on document size, change rate, and so on, but the idea is that we
want this writer to do as much merging as it can while still keeping
worst-case write performance down to an acceptable number.

Doing only small merges just puts off the day of reckoning, of course.  By
avoiding big consolidations, we are slowly accumulating small-to-medium sized
segments and causing a gradual degradation of search-time performance.

What we'd like is a separate write process, operating (mostly) in the
background, dedicated solely to merging segments which contain at least
mergeThreshold docs.

If all we have to do is add documents to the index, adding that second write
process isn't a big deal.  We have to worry about competion for segment,
snapshot, and temp file names, but that's about it.

Deletions make matters more complicated, but with a tombstone-based deletions
mechanism, the problems are solvable.

When the background merge writer starts up, it will see a particular view of
the index in time, including deletions.  It will perform nearly all of its
operations based on this view of the index, mapping around documents which
were marked as deleted at init time.

In between the time when the background merge writer starts up and the time it
finishes consolidating segment data, we assume that the primary writer will
have modified the index.

  * New docs have been added in new segments.
  * Tombstones have been added which suppress documents in segments which
didn't even exist when the background merge writer started up.
  * Tombstones have been added which suppress documents in segments which
existed when the background merge writer started up, but were not merged.
  * Tombstones have been added which suppress documents in segments which have
just been merged.

Only the last category of deletions matters.

At this point, the background merge writer aquires an exclusive write lock on
the index.  It examines recently added tombstones, translates the document
numbers and writes a tombstone file against itself.  Then it writes the
snapshot file to commit its changes and releases the write lock.

Worst case update performance for the system is now the sum of the time it
takes the background merge writer consolidate tombstones and worst-case
performance of the primary writer.

 It sounds like this is your solution for decoupling segments changes due
 to merges from changes from docs being indexed, from a reader's standpoint?

It's true that we are decoupling the process of making logical changes to the
index from the process of internal consolidation.  I probably wouldn't
describe that as being done from the reader's standpoint, though.

With mmap and data structures optimized for it, we basically solve the
read-time responsiveness cost problem.  From the client perspective, the delay
between firing off a change order and seeing that change made live is now
dominated by the time it takes to actually update the index.  The time between
the commit and having an IndexReader which can see that commit is negligible
in comparision.

 Since you are using mmap to achieve near zero brand-new IndexReader
 creation, whereas in Lucene we are moving towards achieving real-time
 by always reopening a current IndexReader (not a brand new one), it
 seems like you should not actually have to worry about the case of
 reopening a reader after a large merge has finished?

Even though we can rely on mmap rather than slurping, there are potentially a
lot of files to open and a lot of JSON-encoded metadata to parse, so I'm not
certain that Lucy/KS

Re: Realtime Search

There is also the distributed model - but in that case each node is running 
some sort of server anyway (as in Hadoop).

It seems that the distributed model would be easier to develop using Hadoop 
over the embedded model.

-Original Message-
From: Robert Engels reng...@ix.netcom.com
Sent: Dec 26, 2008 2:34 PM
To: java-dev@lucene.apache.org
Subject: Re: Realtime Search

If you move to the either embedded, or server model, the post reopen is 
trivial, as the structures can be created as the segment is written.

It is the networked shared access model that causes a lot of these 
optimizations to be far more complex than needed.

Would it maybe be simpler to move the embedded or server model, and add a 
network shared file (e.g. nfs) access model as a layer?  The latter is going 
to perform far worse anyway.

I guess I don't understand why Lucene continues to try and support this model. 
NO ONE does it any more.  This is the way MS Access worked, and everyone that 
wanted performance needed to move to SQL server for the server model.

-Original Message-
From: Marvin Humphrey mar...@rectangular.com
Sent: Dec 26, 2008 12:53 PM
To: java-dev@lucene.apache.org
Subject: Re: Realtime Search

On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote:
   4) Allow 2 concurrent writers: one for small, fast updates, and one for
  big background merges.

 Marvin can you describe more detail here? 

The goal is to improve worst-case write performance.  

Currently, writes are quick most of the time, but occassionally you'll trigger
a big merge and get stuck.  To solve this problem, we can assign a merge
policy to our primary writer which tells it to merge no more than
mergeThreshold documents.  The value of mergeTheshold will need tuning
depending on document size, change rate, and so on, but the idea is that we
want this writer to do as much merging as it can while still keeping
worst-case write performance down to an acceptable number.

Doing only small merges just puts off the day of reckoning, of course.  By
avoiding big consolidations, we are slowly accumulating small-to-medium sized
segments and causing a gradual degradation of search-time performance.

What we'd like is a separate write process, operating (mostly) in the
background, dedicated solely to merging segments which contain at least
mergeThreshold docs.

If all we have to do is add documents to the index, adding that second write
process isn't a big deal.  We have to worry about competion for segment,
snapshot, and temp file names, but that's about it.

Deletions make matters more complicated, but with a tombstone-based deletions
mechanism, the problems are solvable.

When the background merge writer starts up, it will see a particular view of
the index in time, including deletions.  It will perform nearly all of its
operations based on this view of the index, mapping around documents which
were marked as deleted at init time.

In between the time when the background merge writer starts up and the time it
finishes consolidating segment data, we assume that the primary writer will
have modified the index.

  * New docs have been added in new segments.
  * Tombstones have been added which suppress documents in segments which
didn't even exist when the background merge writer started up.
  * Tombstones have been added which suppress documents in segments which
existed when the background merge writer started up, but were not merged.
  * Tombstones have been added which suppress documents in segments which have
just been merged.

Only the last category of deletions matters.

At this point, the background merge writer aquires an exclusive write lock on
the index.  It examines recently added tombstones, translates the document
numbers and writes a tombstone file against itself.  Then it writes the
snapshot file to commit its changes and releases the write lock.

Worst case update performance for the system is now the sum of the time it
takes the background merge writer consolidate tombstones and worst-case
performance of the primary writer.

 It sounds like this is your solution for decoupling segments changes due
 to merges from changes from docs being indexed, from a reader's standpoint?

It's true that we are decoupling the process of making logical changes to the
index from the process of internal consolidation.  I probably wouldn't
describe that as being done from the reader's standpoint, though.

With mmap and data structures optimized for it, we basically solve the
read-time responsiveness cost problem.  From the client perspective, the delay
between firing off a change order and seeing that change made live is now
dominated by the time it takes to actually update the index.  The time between
the commit and having an IndexReader which can see that commit is negligible
in comparision.

 Since you are using mmap to achieve near zero brand-new IndexReader
 creation, whereas in Lucene we are moving towards achieving

Re: Realtime Search

2008-12-26 Thread Marvin Humphrey

Robert,

Three exchanges ago in this thread, you made the incorrect assumption that the
motivation behind using mmap was read speed, and that memory mapping was being
waved around as some sort of magic wand:

Is there something that I am missing? I see lots of references to
using memory mapped files to dramatically improve performance.

I don't think this is the case at all. At the lowest levels, it is
somewhat more efficient from a CPU standpoint, but with a decent OS
cache the IO performance difference is going to negligible.

In response, I indicated that the mmap design had been discussed in JIRA, and
pointed you at a particular issue.

There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.

The dramatic improvement is WRT to opening/reopening an IndexReader.

Apparently, you did not go back to read that JIRA thread, because you
subsequently offered a critique of a purely invented design you assumed we
must have arrived at, and continued to argue with a straw man about read
speed:

1. with fixed size terms, the additional IO (larger pages) probably
offsets a lot of the random access benefit. This is why compressed
disks on a fast machine (CPU) are often faster than uncompressed -
more data is read during every IO access.

While my reply did not specifically point back to LUCENE-1458 again, I hoped
that having your foolish assumption exposed would motivate you to go back and
read it, so that you could offer an informed critique of the *actual* design.
I also linked to a specific comment in LUCENE-831 which explained how mmap
applied to sort caches.

Additionally, sort caches would be written at index time in three files, and
memory mapped as laid out in

https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150.

Apparently you still didn't go back and read up, because you subsequently made
a third incorrect assumption, this time about plans to do away with the term
dictionary index. In response I griped about JIRA again, using slightly
stronger but still intentionally indirect language.

No. That idea was entertained briefly and quickly discarded. There seems
to be an awful lot of irrelevant noise in the current thread arising due
to lack of familiarity with the ongoing discussions in JIRA.

Unfortunately, this must not have worked either, because you have now offered a
fourth message based on incorrect assumptions which would have been remedied by
bringing yourself up to date with the relevant JIRA threads.

That could very well be, but I was referencing your statement:

1) Design index formats that can be memory mapped rather than slurped,
bringing the cost of opening/reopening an IndexReader down to a
negligible level.

The only reason to do this (or have it happen) is if you perform a binary
search on the term index.

No. As discussed in LUCENE-1458, LUCENE-1483, the specific link I pointed you
towards in LUCENE-831, the message where I provided you with that link, and
elsewhere in this thread... loading the term dictionary index is important, but
the cost pales in comparison to the cost of loading sort caches.

Using a 2 file system is going to be WAY slower - I'll bet lunch. It might be
workable if the files were on a striped drive, or put each file on a different
drive/controller, but requiring such specially configured hardware is not a
good idea. In the common case (single drive), you are going to be seeking all
over the place.

Mike McCandless and I had an extensive debate about the pros and cons of
depending on the OS cache to hold the term dictionary index under LUCENE-1458.
The concerns you express here were fully addressed, and even resolved under an
agree to disagree design.

Also, the mmap is only suitable for 64 bit platforms, since there is no way
in Java to unmap, you are going to run out of address space as segments are
rewritten.

The discussion of how the mmap design translates from Lucy to Lucene is an
important one, but I despair of having it if we have to rehash all of
LUCENE-1458, LUCENE-831, and possibly LUCENE-1476 and LUCENE-1483 because you
cannot be troubled to bring yourself up to speed before commenting.

You are obviously knowledgable on the subject of low level memory issues. Me
and Mike McCandless ain't exactly chopped liver, though, and neither are a lot
of other people around here who *are* bothering to keep up with the threads in
JIRA. I request that you show the rest of us more respect. Our time is
valuable, too.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-09-20 Thread Noble Paul നോബിള്‍ नोब्ळ्

You are full of crap. From your own comments in Lucene 1458:

The work on streamlining the term dictionary is excellent, but perhaps we can 
do better still. Can we design a format that allows us rely upon the operating 
system's virtual memory and avoid caching in process memory altogether?

Say that we break up the index file into fixed-width blocks of 1024 bytes. Most 
blocks would start with a complete term/pointer pairing, though at the top of 
each block, we'd need a status byte indicating whether the block contains a 
continuation from the previous block in order to handle cases where term length 
exceeds the block size.

For Lucy/KinoSearch our plan would be to mmap() on the file, but accessing it 
as a stream would work, too. Seeking around the index term dictionary would 
involve seeking the stream to multiples of the block size and performing binary 
search, rather than performing binary search on an array of cached terms. There 
would be increased processor overhead; my guess is that since the second stage 
of a term dictionary seek – scanning through the primary term dictionary – 
involves comparatively more processor power than this, the increased costs 
would be acceptable.

and then you state farther down

Killing off the term dictionary index yields a nice improvement in code and 
file specification simplicity, and there's no performance penalty for our 
primary optimization target use case.

 We could also explore something in-between, eg it'd be nice to
 genericize MultiLevelSkipListWriter so that it could index arbitrary
 files, then we could use that to index the terms dict. You could
 choose to spend dedicated process RAM on the higher levels of the skip
 tree, and then tentatively trust IO cache for the lower levels.

That doesn't meet the design goals of bringing the cost of opening/warming an 
IndexReader down to near-zero and sharing backing buffers among multiple forks. 
It's also very complicated, which of course bothers me more than it bothers 
you. So I imagine we'll choose different paths.

The thing I find funny is that many are approaching these issues as if new 
ground is being broken.  These are ALL standard, long-known issues that any 
database engineer has already worked with, and there are accepted designs given 
applicable constraints.

This is why I've tried to point folks towards alternative designs that open the 
door much wider to increased performance/reliability/robustness.

Do what you like. You obviously will.  This is the problem with the Lucene 
managers - the problems are only the ones they see - same with the solutions.  
If the solution (or questions) put them outside their comfort zone, they are 
ignored or dismissed in a tone that is designed to limit any further questions 
(especially those that might question their ability and/or understanding).



-Original Message-
From: Marvin Humphrey mar...@rectangular.com
Sent: Dec 26, 2008 3:53 PM
To: java-dev@lucene.apache.org, Robert Engels reng...@ix.netcom.com
Subject: Re: Realtime Search

Robert,

Three exchanges ago in this thread, you made the incorrect assumption that the
motivation behind using mmap was read speed, and that memory mapping was being
waved around as some sort of magic wand:

Is there something that I am missing? I see lots of references to
using memory mapped files to dramatically improve performance.

I don't think this is the case at all. At the lowest levels, it is
somewhat more efficient from a CPU standpoint, but with a decent OS
cache the IO performance difference is going to negligible.

In response, I indicated that the mmap design had been discussed in JIRA, and
pointed you at a particular issue.

There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.

The dramatic improvement is WRT to opening/reopening an IndexReader.

Apparently, you did not go back to read that JIRA thread, because you
subsequently offered a critique of a purely invented design you assumed we
must have arrived at, and continued to argue with a straw man about read
speed:

1. with fixed size terms, the additional IO (larger pages) probably  
offsets a lot of the random access benefit. This is why compressed  
disks on a fast machine (CPU) are often faster than uncompressed -  
more data is read during every IO access.

While my reply did not specifically point back to LUCENE-1458 again, I hoped
that having your foolish assumption exposed would motivate you to go back and
read it, so that you could offer an informed critique of the *actual* design.
I also linked to a specific comment in LUCENE-831 which explained how mmap
applied to sort caches.

Additionally, sort caches would be written at index time in three files, 
 and
memory mapped as laid out in 

 https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150.

Apparently you still didn't go back and read up, because you

Re: Realtime Search

2008-12-26 Thread Andrzej Bialecki


Robert Engels wrote:

You are full of **beep** *beep* ...


No matter whether you are right or wrong, please keep a civil tone on 
this public forum. We are professionals here, so let's discuss and 
disagree if must be - but in a professional and grown-up way. Thank you.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-25 Thread Michael McCandless

I think the necessary low-level changes to Lucene for real-time are
actually already well underway...

The biggest barrier is how we now ask for FieldCache values a the
Multi*Reader level.  This makes reopen cost catastrophic for a large
index.

Once we succeed in making FieldCache usage within Lucene
segment-centric (LUCENE-1483 = sorting becomes segment-centric),
LUCENE-831 (= deprecate old FieldCache API in favor of segment-centric
or iteration API), we are most of the way there.  LUCENE-1231 (column
stride fields) should make initing the per-segment FieldCache much
faster, though I think that's a nice to have for real-time search
(because either 1) warming will happen in the BG, or 2) the segment is
tiny).

So then I think we should start with approach #2 (build real-time on
top of the Lucene core) and iterate from there.  Newly added docs go
into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
deleted docs record the delete against the right SegmentReader (and
LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).

I would take the simple approach first: use ordinary SegmentReader on
a RAMDirectory for the tiny segments.  If that proves too slow, swap
in Memory/InstantiatedIndex for the tiny segments.  If that proves too
slow, build a reader impl that reads from DocumentsWriter RAM buffer.

One challenge is reopening after a big merge finishes... we'd need a
way to 1) allow the merge to be committed, then 2) start warming a new
reader in the BG, but 3) allow newly flushed segments to use the old
SegmentReaders reading the segments that were merged (because they are
still warm), and 4) once new reader is warm, we decref old segments
and use the new reader going forwards.

Alternatively, and maybe simpler, a merge is not allowed to commit
until a new SegmentReader has been warmed against the newly merged
segment.

I'm not sure how best to do this... we may need more info in
SegmentInfo[s] to track the genealogy of each segment, or something.
We may need to have IndexWriter give more info when it's modifying
SegmentInfos, eg we'd need the reader to access newly flushed segments
(IndexWriter does not write a new segments_N until commit).  Maybe
IndexWriter needs to warm readers... maybe IndexReader.open/reopen
needs to be given an IndexWriter and then access its un-flushed
in-memory SegmentInfos... not sure.  We'd need to fix
SegmentReader.get to provide single instance for a given segment.

I agree we'd want a specialized merge policy.  EG it should merge RAM
segments w/ higher priority, and probably not merge mixed RAM  disk
segments.

Mike
Jason Rutherglen jason.rutherg...@gmail.com wrote:

 We've discussed realtime search before, it looks like after the next
 release we can get some sort of realtime search working.  I was going to
 open a new issue but decided it might be best to discuss realtime search on
 the dev list.

 Lucene can implement realtime search as the ability to add, update, or
 delete documents with latency in the sub 5 millisecond range.  A couple of
 different options are available.

 1) Expose a rolling set of realtime readers over the memory index used by
 IndexWriter.  Requires incrementally updating field caches and filters, and
 is somewhat unclear how IndexReader versioning would work (for example
 versions of the term dictionary).
 2) Implement realtime search by incrementally creating and merging readers
 in memory.  The system would use MemoryIndex or InstantiatedIndex to quickly
 (more quickly than RAMDirectory) create indexes from added documents.  The
 in memory indexes would be periodically merged in the background and
 according to RAM used write to disk.  Each update would generate a new
 IndexReader or MultiSearcher that includes the new updates.  Field caches
 and filters could be cached per IndexReader according to how Lucene works
 today.  The downside of this approach is the indexing will not be as fast as
 #1 because of the in memory merging which similar to the Lucene pre 2.3
 which merged in memory segments using RAMDirectory.

 Are there other implementation options?

 A new patch would focus on providing in memory indexing as part of the core
 of Lucene.  The work of LUCENE-1483 and LUCENE-1314 would be used.  I am not
 sure if option #2 can become part of core if it relies on a contrib module?
 It makes sense to provide a new realtime oriented merge policy that merges
 segments based on the number of deletes rather than a merge factor.  The
 realtime merge policy would keep the segments within a minimum and maximum
 size in kilobytes to limit the time consumed by merging which is assumed
 would occur frequently.

 LUCENE-1313 which includes a transaction log with rollback and was designed
 with distributed search and may be retired or the components split out.

Re: Realtime Search

2008-12-24 Thread robert engels

Thinking about this some more, you could use fixed length pages for  
the term index, with a page header containing a count of entries, and  
use key compression (to avoid the constant entry size).


The problem with this is that you still have to decode the entries  
(slowing the processing - since a simple binary search on the page is  
not possible).


But, if you also add a 'least term and greatest term' to the page  
header (you can avoid the duplicate storage of these entries as  
well), you can perform a binary search of the term index much faster.  
You only need to decode the index page containing (maybe) the desired  
entry.


If you were doing a prefix/range search, you will still end up  
decoding lots of pages...


This is why a database has their own page cache, and usually caches  
the decoded form (for index pages) for faster processing - at the  
expense of higher memory usage. Usually data pages are not cached in  
the decoded/uncompressed form. In most cases the database vendor will  
recommend removing the OS page cache on the database server, and  
allocating all of the memory to the database process.


You may be able to avoid some of the warm-up of an index using memory  
mapped files, but with proper ordering of the writing of the index,  
it probably isn't necessary. Beyond that, processing the term index  
directly using NIO does not appear that it will be faster than using  
an in-process cache of the term index (similar to the skip-to memory  
index now).


The BEST approach is probably to have the index writer build the  
memory skip to structure as it writes the segment, and then include  
this in the segment during the reopen - no warming required !. As  
long as the reader and writer are in the same process, it will be a  
winner !


On Dec 23, 2008, at 11:02 PM, robert engels wrote:

Seems doubtful you will be able to do this without increasing the  
index size dramatically. Since it will need to be stored  
unpacked (in order to have random access), yet the terms are  
variable length - leading to using a maximum=minimum size for every  
term.


In the end I highly doubt it will make much difference in speed -  
here's several reasons why...


1. with fixed size terms, the additional IO (larger pages)  
probably offsets a lot of the random access benefit. This is why  
compressed disks on a fast machine (CPU) are often faster than  
uncompressed - more data is read during every IO access.


2. with a reopen, only new segments are read, and since it is a  
new segment, it is most likely already in the disk cache (from the  
write), so the reopen penalty is negligible (especially if the term  
index skip to is written last).


3. If the reopen is after an optimize - when the OS cache has  
probably been obliterated, then the warm up time is going to be  
similar in most cases anyway, since the index pages will also not  
be in core (in the case of memory mapped files). Again, writing the  
skip to last can help with this.


Just because a file is memory mapped does not mean its pages will  
have an greater likelihood to be in the cache. The locality of  
reference is going to control this, just as the most/often access  
controls it in the OS disk cache.  Also, most OSs will take real  
memory from the virtual address space and add it to the disk cache  
if the process is doing lots of IO.


If you have a memory mapped term index, you are still going to  
need to perform a binary search to find the correct term page,  
and after an optimize the visited pages will not be in the cache  
(or in core).


On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote:


On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:

Is there something that I am missing?


Yes.

I see lots of references to  using memory mapped files to  
dramatically

improve performance.


There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.

The dramatic improvement is WRT to opening/reopening an  
IndexReader.
Presently in both KS and Lucene, certain data structures have to  
be read at
IndexReader startup and unpacked into process memory -- in  
particular, the
term dictionary index and sort caches.  If those data structures  
can be
represented by a memory mapped file rather than built up from  
scratch, we save

big.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-24 Thread Paul Elschot


Op Wednesday 24 December 2008 17:51:04 schreef robert engels:
 Thinking about this some more, you could use fixed length pages for
 the term index, with a page header containing a count of entries, and
 use key compression (to avoid the constant entry size).

 The problem with this is that you still have to decode the entries
 (slowing the processing - since a simple binary search on the page is
 not possible).

The cache between the pages and the cpu is also a bottleneck nowadays.
See here:

Super-Scalar RAM-CPU Cache Compression
M Zukowski, S Heman, N Nes, P Boncz - cwi.nl

currently available from this link:

http://www.cwi.nl/htbin/ins1/publications?request=pdfgzkey=ZuHeNeBo:ICDE:06

Also, some preliminary results on lucene indexes
are available at LUCENE-1410.

Regards,
Paul Elschot



 But, if you also add a 'least term and greatest term' to the page
 header (you can avoid the duplicate storage of these entries as
 well), you can perform a binary search of the term index much faster.
 You only need to decode the index page containing (maybe) the desired
 entry.

 If you were doing a prefix/range search, you will still end up
 decoding lots of pages...

 This is why a database has their own page cache, and usually caches
 the decoded form (for index pages) for faster processing - at the
 expense of higher memory usage. Usually data pages are not cached in
 the decoded/uncompressed form. In most cases the database vendor will
 recommend removing the OS page cache on the database server, and
 allocating all of the memory to the database process.

 You may be able to avoid some of the warm-up of an index using memory
 mapped files, but with proper ordering of the writing of the index,
 it probably isn't necessary. Beyond that, processing the term index
 directly using NIO does not appear that it will be faster than using
 an in-process cache of the term index (similar to the skip-to memory
 index now).

 The BEST approach is probably to have the index writer build the
 memory skip to structure as it writes the segment, and then include
 this in the segment during the reopen - no warming required !. As
 long as the reader and writer are in the same process, it will be a
 winner !

 On Dec 23, 2008, at 11:02 PM, robert engels wrote:
  Seems doubtful you will be able to do this without increasing the
  index size dramatically. Since it will need to be stored
  unpacked (in order to have random access), yet the terms are
  variable length - leading to using a maximum=minimum size for every
  term.
 
  In the end I highly doubt it will make much difference in speed -
  here's several reasons why...
 
  1. with fixed size terms, the additional IO (larger pages)
  probably offsets a lot of the random access benefit. This is why
  compressed disks on a fast machine (CPU) are often faster than
  uncompressed - more data is read during every IO access.
 
  2. with a reopen, only new segments are read, and since it is a
  new segment, it is most likely already in the disk cache (from the
  write), so the reopen penalty is negligible (especially if the term
  index skip to is written last).
 
  3. If the reopen is after an optimize - when the OS cache has
  probably been obliterated, then the warm up time is going to be
  similar in most cases anyway, since the index pages will also not
  be in core (in the case of memory mapped files). Again, writing the
  skip to last can help with this.
 
  Just because a file is memory mapped does not mean its pages will
  have an greater likelihood to be in the cache. The locality of
  reference is going to control this, just as the most/often access
  controls it in the OS disk cache.  Also, most OSs will take real
  memory from the virtual address space and add it to the disk cache
  if the process is doing lots of IO.
 
  If you have a memory mapped term index, you are still going to
  need to perform a binary search to find the correct term page,
  and after an optimize the visited pages will not be in the cache
  (or in core).
 
  On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote:
  On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:
  Is there something that I am missing?
 
  Yes.
 
  I see lots of references to  using memory mapped files to
  dramatically
  improve performance.
 
  There have been substantial discussions about this design in JIRA,
  notably LUCENE-1458.
 
  The dramatic improvement is WRT to opening/reopening an
  IndexReader.
  Presently in both KS and Lucene, certain data structures have to
  be read at
  IndexReader startup and unpacked into process memory -- in
  particular, the
  term dictionary index and sort caches.  If those data structures
  can be
  represented by a memory mapped file rather than built up from
  scratch, we save
  big.
 
  Marvin Humphrey
 
 
  --
 --- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail:

Re: Realtime Search

2008-12-24 Thread Doug Cutting


Jason Rutherglen wrote:
2) Implement realtime search by incrementally creating and merging 
readers in memory.  The system would use MemoryIndex or 
InstantiatedIndex to quickly (more quickly than RAMDirectory) create 
indexes from added documents.


As a baseline, how fast is it to simply use RAMDirectory?  If one, e.g., 
flushes changes every 10ms or so, and has a background thread that uses 
IndexReader.reopen() to keep a fresh version for reading?


Also, what are the requirements?  Must a document be visible to search 
within 10ms of being added?  Or must it be visible to search from the 
time that the call to add it returns?  In the latter case one might 
still use an approach like the above.  Writing a small new segment to a 
RAMDirectory and then, with no merging, calling IndexReader.reopen(), 
should be quite fast.  All merging could be done in the background, as 
should post-merge reopens() that involve large segments.


In short, I wonder if new reader and writer implementations are in fact 
required or whether, perhaps with a few optimizations, the existing 
implementations might meet this need.


Doug

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-24 Thread robert engels

As I pointed out in another email, I understand the benefits of  
compression (compressed disks vs. uncompressed, etc.). PFOR is  
definitely a winner !


As I understood this discussion though, it was an attempt to remove  
the in memory 'skip to' index, to avoid the reading of this during  
index open/reopen.


I was attempting to point out that this in-memory index is still  
needed, but there are ways to improve the current process.


I don't think a mapped file for the term index is going to work for a  
variety of reasons. Mapped files are designed as a programming  
simplification - mainly for older systems that use line delimited  
files - rather than having to create page/section caches when  
processing very large files (when only a small portion is used at any  
given time - ie. the data visible on the screen). When you end up  
visiting a large portion of the file anyway (to do a full  
repagination), an in-process intelligent cache is going to be far  
superior.


My review of the Java Buffer related classes does not give me the  
impression it is going to be faster - in fact it will be slower- than  
a single copy into user space, and process/decompress there. The  
Buffer system is suitable when perform little inspection, and then  
direct copy to another buffer (think reading from a file, and sending  
out on a socket). If you end up inspecting the buffer, it is going to  
be very slow.


On Dec 24, 2008, at 11:33 AM, Paul Elschot wrote:



Op Wednesday 24 December 2008 17:51:04 schreef robert engels:

Thinking about this some more, you could use fixed length pages for
the term index, with a page header containing a count of entries, and
use key compression (to avoid the constant entry size).

The problem with this is that you still have to decode the entries
(slowing the processing - since a simple binary search on the page is
not possible).


The cache between the pages and the cpu is also a bottleneck nowadays.
See here:

Super-Scalar RAM-CPU Cache Compression
M Zukowski, S Heman, N Nes, P Boncz - cwi.nl

currently available from this link:

http://www.cwi.nl/htbin/ins1/publications? 
request=pdfgzkey=ZuHeNeBo:ICDE:06


Also, some preliminary results on lucene indexes
are available at LUCENE-1410.

Regards,
Paul Elschot




But, if you also add a 'least term and greatest term' to the page
header (you can avoid the duplicate storage of these entries as
well), you can perform a binary search of the term index much faster.
You only need to decode the index page containing (maybe) the desired
entry.

If you were doing a prefix/range search, you will still end up
decoding lots of pages...

This is why a database has their own page cache, and usually caches
the decoded form (for index pages) for faster processing - at the
expense of higher memory usage. Usually data pages are not cached in
the decoded/uncompressed form. In most cases the database vendor will
recommend removing the OS page cache on the database server, and
allocating all of the memory to the database process.

You may be able to avoid some of the warm-up of an index using memory
mapped files, but with proper ordering of the writing of the index,
it probably isn't necessary. Beyond that, processing the term index
directly using NIO does not appear that it will be faster than using
an in-process cache of the term index (similar to the skip-to memory
index now).

The BEST approach is probably to have the index writer build the
memory skip to structure as it writes the segment, and then include
this in the segment during the reopen - no warming required !. As
long as the reader and writer are in the same process, it will be a
winner !

On Dec 23, 2008, at 11:02 PM, robert engels wrote:

Seems doubtful you will be able to do this without increasing the
index size dramatically. Since it will need to be stored
unpacked (in order to have random access), yet the terms are
variable length - leading to using a maximum=minimum size for every
term.

In the end I highly doubt it will make much difference in speed -
here's several reasons why...

1. with fixed size terms, the additional IO (larger pages)
probably offsets a lot of the random access benefit. This is why
compressed disks on a fast machine (CPU) are often faster than
uncompressed - more data is read during every IO access.

2. with a reopen, only new segments are read, and since it is a
new segment, it is most likely already in the disk cache (from the
write), so the reopen penalty is negligible (especially if the term
index skip to is written last).

3. If the reopen is after an optimize - when the OS cache has
probably been obliterated, then the warm up time is going to be
similar in most cases anyway, since the index pages will also not
be in core (in the case of memory mapped files). Again, writing the
skip to last can help with this.

Just because a file is memory mapped does not mean its pages will
have an greater likelihood to be in the cache. The locality of
reference is

Re: Realtime Search

2008-12-24 Thread Jason Rutherglen

Also, what are the requirements? Must a document be visible to search
within 10ms of being added?

0-5ms. Otherwise it's not realtime, it's batch indexing. The realtime
system can support small batches by encoding them into RAMDirectories if
they are of sufficient size.

Or must it be visible to search from the time that the call to add it
returns?

Most people probably expect the update latency offered by SQL databases.

As a baseline, how fast is it to simply use RAMDirectory?

It depends on how fast searches over the realtime index need to be. The
detriment to speed occurs with having many small segments that are
continuously decoded (terms, postings, etc). The advantage of MemoryIndex
and InstantiatedIndex is an actual increase in search speed compared with
RAMDirectory (see the Performance Notes at
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/memory/MemoryIndex.htmland
)and no need to continuously decode segments that are short lived.

Anecdotal tests indicated the merging overhead of using RAMDirectory as
compared with MI or II is significant enough to make it only useful for
doing batches in the 1000s which does not seem to be what people expect from
realtime search.

On Wed, Dec 24, 2008 at 9:53 AM, Doug Cutting cutt...@apache.org wrote:

Jason Rutherglen wrote:

2) Implement realtime search by incrementally creating and merging readers
in memory. The system would use MemoryIndex or InstantiatedIndex to quickly
(more quickly than RAMDirectory) create indexes from added documents.

As a baseline, how fast is it to simply use RAMDirectory? If one, e.g.,
flushes changes every 10ms or so, and has a background thread that uses
IndexReader.reopen() to keep a fresh version for reading?

Also, what are the requirements? Must a document be visible to search
within 10ms of being added? Or must it be visible to search from the time
that the call to add it returns? In the latter case one might still use an
approach like the above. Writing a small new segment to a RAMDirectory and
then, with no merging, calling IndexReader.reopen(), should be quite fast.
All merging could be done in the background, as should post-merge reopens()
that involve large segments.

In short, I wonder if new reader and writer implementations are in fact
required or whether, perhaps with a few optimizations, the existing
implementations might meet this need.

Doug

To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-24 Thread robert engels



On Dec 24, 2008, at 12:23 PM, Jason Rutherglen wrote:

 Also, what are the requirements?  Must a document be visible to  
search within 10ms of being added?


0-5ms.  Otherwise it's not realtime, it's batch indexing.  The  
realtime system can support small batches by encoding them into  
RAMDirectories if they are of sufficient size.


 Or must it be visible to search from the time that the call to  
add it returns?


Most people probably expect the update latency offered by SQL  
databases.


This is the problem spot. In an SQL database, when an update/add  
occurs, the same connection/transaction will see the changes when  
requested IMMEDIATELY - there is 0 latency.


In order to do this you MUST have the concept of transactions and/or  
connections.


OR you must make it so that every update/add is immediately available  
- this is probably simpler.


You just need to always search the ram and the disk index. The  
deletions must be mapped to the disk index, and the latest version  
of the document must be obtained from the ram index (if it is there).


You just need to merge the ram and disk in the background... and  
continually create new/merged ram disks.


The memory requirements are going to go up, but you can always add a  
block so that if the background merger gets too far behind, the  
system blocks any current requests (to avoid the system running out  
of memory).





 As a baseline, how fast is it to simply use RAMDirectory?

It depends on how fast searches over the realtime index need to  
be.  The detriment to speed occurs with having many small segments  
that are continuously decoded (terms, postings, etc).  The  
advantage of MemoryIndex and InstantiatedIndex is an actual  
increase in search speed compared with RAMDirectory (see the  
Performance Notes at http://hudson.zones.apache.org/hudson/job/ 
Lucene-trunk/javadoc//org/apache/lucene/index/memory/ 
MemoryIndex.html and )and no need to continuously decode segments  
that are short lived.


Anecdotal tests indicated the merging overhead of using  
RAMDirectory as compared with MI or II is significant enough to  
make it only useful for doing batches in the 1000s which does not  
seem to be what people expect from realtime search.


On Wed, Dec 24, 2008 at 9:53 AM, Doug Cutting cutt...@apache.org  
wrote:

Jason Rutherglen wrote:
2) Implement realtime search by incrementally creating and merging  
readers in memory.  The system would use MemoryIndex or  
InstantiatedIndex to quickly (more quickly than RAMDirectory)  
create indexes from added documents.


As a baseline, how fast is it to simply use RAMDirectory?  If one,  
e.g., flushes changes every 10ms or so, and has a background thread  
that uses IndexReader.reopen() to keep a fresh version for reading?


Also, what are the requirements?  Must a document be visible to  
search within 10ms of being added?  Or must it be visible to search  
from the time that the call to add it returns?  In the latter case  
one might still use an approach like the above.  Writing a small  
new segment to a RAMDirectory and then, with no merging, calling  
IndexReader.reopen(), should be quite fast.  All merging could be  
done in the background, as should post-merge reopens() that involve  
large segments.


In short, I wonder if new reader and writer implementations are in  
fact required or whether, perhaps with a few optimizations, the  
existing implementations might meet this need.


Doug

-

To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-24 Thread Marvin Humphrey

On Tue, Dec 23, 2008 at 11:02:56PM -0600, robert engels wrote:
Seems doubtful you will be able to do this without increasing the
index size dramatically. Since it will need to be stored
unpacked (in order to have random access), yet the terms are
variable length - leading to using a maximum=minimum size for every
term.

Wow. That's a spectacularly awful design. Its worst case -- one outlier
term, say, 1000 characters in length, in a field where the average term length
is in the single digits -- would explode the index size and incur wasteful IO
overhead, just as you say.

Good thing we've never considered it. :)

I'm hoping we can improve on this, but for now, we've ended up at a two-file
design for the term dictionary index.

1) Stacked 64-bit file pointers.
2) Variable length character and term info data, interpreted using a
pluggable codec.

In the index at least, each entry would contain the full term text, encoded as
UTF-8. Probably the primary term dictionary would continue to use string
diffs.

That design offers no significant benefits other than those that flow from
compatibility with mmap: faster IndexReader open/reaopen, lower RAM usage
under multiple processes by way of buffer sharing. IO bandwidth requirements
and speed are probably a little better, but lookups on the term dictionary
index are not a significant search-time bottleneck.

Additionally, sort caches would be written at index time in three files, and
memory mapped as laid out in
https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150.

1) Stacked 64-bit file pointers.
2) Character data.
3) Doc num to ord mapping.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-24 Thread Marvin Humphrey

On Wed, Dec 24, 2008 at 12:02:24PM -0600, robert engels wrote:
 As I understood this discussion though, it was an attempt to remove  
 the in memory 'skip to' index, to avoid the reading of this during  
 index open/reopen.

No.  That idea was entertained briefly and quickly discarded.  There seems to
be an awful lot of irrelevant noise in the current thread arising due to lack
of familiarity with the ongoing discussions in JIRA.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-23 Thread Marvin Humphrey

On Tue, Dec 23, 2008 at 05:51:43PM -0800, Jason Rutherglen wrote:

 Are there other implementation options?

Here's the plan for Lucy/KS:

  1) Design index formats that can be memory mapped rather than slurped,
 bringing the cost of opening/reopening an IndexReader down to a
 negligible level.
  2) Enable segment-centric sorted search. (LUCENE-1483) 
  3) Implement tombstone-based deletions, so that the cost of deleting
 documents scales with the number of deletions rather than the size of the
 index.
  4) Allow 2 concurrent writers: one for small, fast updates, and one for
 big background merges.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-23 Thread robert engels

Is there something that I am missing? I see lots of references to  
using memory mapped files to dramatically improve performance.


I don't think this is the case at all. At the lowest levels, it is  
somewhat more efficient from a CPU standpoint, but with a decent OS  
cache the IO performance difference is going to negligible.


The primary benefit of memory mapped files is simplicity in code  
(although in Java there is another layer needed - think C ), and the  
file can be treated as a random accessible memory array.


From my OS design experience, the page at http://en.wikipedia.org/ 
wiki/Memory-mapped_file is incorrect.


Even if the memory mapped file is mapped into the virtual memory  
space, unless you specialized memory controllers and disk systems,  
when a page fault occurs, the OS loads the page just as any other.


The difference with direct IO, is that there is first a simple  
translation from position to disk page, and the OS disk page cache is  
checked. Almost exactly the same thing occurs with a memory mapped file.


The memory addressed is accessed, if not in memory, a page fault  
occurs, and the page is loaded from the file (it may be loaded from  
the OS disk cache in this process).


The point being, if the page is not in the cache (which is probably  
the case with a large index), the time to load the page is far  
greater than the difference between the IO address translation and  
the memory address lookup.


If all of the pages of the index can fit in memory, a properly  
configured system is going to have them in the page cache anyway




On Dec 23, 2008, at 8:22 PM, Marvin Humphrey wrote:


On Tue, Dec 23, 2008 at 05:51:43PM -0800, Jason Rutherglen wrote:


Are there other implementation options?


Here's the plan for Lucy/KS:

  1) Design index formats that can be memory mapped rather than  
slurped,

 bringing the cost of opening/reopening an IndexReader down to a
 negligible level.
  2) Enable segment-centric sorted search. (LUCENE-1483)
  3) Implement tombstone-based deletions, so that the cost of deleting
 documents scales with the number of deletions rather than the  
size of the

 index.
  4) Allow 2 concurrent writers: one for small, fast updates, and  
one for

 big background merges.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-23 Thread Marvin Humphrey

On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:
 Is there something that I am missing? 

Yes. 

 I see lots of references to  using memory mapped files to dramatically
 improve performance.

There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.

The dramatic improvement is WRT to opening/reopening an IndexReader.
Presently in both KS and Lucene, certain data structures have to be read at
IndexReader startup and unpacked into process memory -- in particular, the
term dictionary index and sort caches.  If those data structures can be
represented by a memory mapped file rather than built up from scratch, we save
big.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-23 Thread robert engels

Seems doubtful you will be able to do this without increasing the  
index size dramatically. Since it will need to be stored  
unpacked (in order to have random access), yet the terms are  
variable length - leading to using a maximum=minimum size for every  
term.


In the end I highly doubt it will make much difference in speed -  
here's several reasons why...


1. with fixed size terms, the additional IO (larger pages) probably  
offsets a lot of the random access benefit. This is why compressed  
disks on a fast machine (CPU) are often faster than uncompressed -  
more data is read during every IO access.


2. with a reopen, only new segments are read, and since it is a new  
segment, it is most likely already in the disk cache (from the  
write), so the reopen penalty is negligible (especially if the term  
index skip to is written last).


3. If the reopen is after an optimize - when the OS cache has  
probably been obliterated, then the warm up time is going to be  
similar in most cases anyway, since the index pages will also not  
be in core (in the case of memory mapped files). Again, writing the  
skip to last can help with this.


Just because a file is memory mapped does not mean its pages will  
have an greater likelihood to be in the cache. The locality of  
reference is going to control this, just as the most/often access  
controls it in the OS disk cache.  Also, most OSs will take real  
memory from the virtual address space and add it to the disk cache if  
the process is doing lots of IO.


If you have a memory mapped term index, you are still going to need  
to perform a binary search to find the correct term page, and after  
an optimize the visited pages will not be in the cache (or in core).


On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote:


On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:

Is there something that I am missing?


Yes.

I see lots of references to  using memory mapped files to  
dramatically

improve performance.


There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.

The dramatic improvement is WRT to opening/reopening an IndexReader.
Presently in both KS and Lucene, certain data structures have to be  
read at
IndexReader startup and unpacked into process memory -- in  
particular, the
term dictionary index and sort caches.  If those data structures  
can be
represented by a memory mapped file rather than built up from  
scratch, we save

big.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-23 Thread robert engels

Also, if you are thinking that accessing the buffer directly will  
be faster than parsing the packed structure, I'm not so sure.


You can review the source for the various buffers, and since the is  
no struct support in Java, you end up combining bytes to make  
longs, etc. Also, a lot of the accesses are though Unsafe, which is  
slower than the indirection on a Java object to access a field.


My review of these classes makes me think that parsing the skip to  
index once into java objects for later use is going to be a lot  
faster overall than accessing the entire mapped file directly on  
every invocation.


On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote:


On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:

Is there something that I am missing?


Yes.

I see lots of references to  using memory mapped files to  
dramatically

improve performance.


There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.

The dramatic improvement is WRT to opening/reopening an IndexReader.
Presently in both KS and Lucene, certain data structures have to be  
read at
IndexReader startup and unpacked into process memory -- in  
particular, the
term dictionary index and sort caches.  If those data structures  
can be
represented by a memory mapped file rather than built up from  
scratch, we save

big.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search for Social Networks Collaboration

2008-10-08 Thread Jason Rutherglen

 archive
   based
   indexes which were used less (yes the search engine default search
   was
   on
   data no more than 1 month old, though user could open the time
   window by
   including archives).
  
   As for SOLR and OCEAN,  I would argue that these semi-structured
   search
   engines are becomming more and more like relational databases with
   full-text
   search capablities (without the benefit of full reletional algebra
   --
   for
   example joins are not possible using SOLR). Notice that real-time
   CRUD
   operations and transactionality are core DB concepts adn have been
   studied
   and developed by database communities for aquite long time. There
   has
   been
   recent efforts on how to effeciently integrate Lucene into
   releational
   databases (see Lucene JVM ORACLE integration, see
  
  
   http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
  
   I think we should seriously look at joining efforts with open-source
   Database engine projects, written in Java (see
   http://java-source.net/open-source/database-engines) in order to
   blend
   IR
   and ORM for once and for all.
  
   -- Joaquin
  
  
  
   I've read Jason's Wiki as well.  Actually, I had to read it a
   number of
   times to understand bits and pieces of it.  I have to admit there
   is
   still
   some fuzziness about the whole things in my head - is Ocean
   something
   that
   already works, a separate project on googlecode.com?  I think so.
If
   so,
   and if you are working on getting it integrated into Lucene, would
   it
   make
   it less confusing to just refer to it as real-time search, so
   there
   is no
   confusion?
  
   If this is to be initially integrated into Lucene, why are things
   like
   replication, crowding/field collapsing, locallucene, name service,
   tag
   index, etc. all mentioned there on the Wiki and bundled with
   description of
   how real-time search works and is to be implemented?  I suppose
   mentioning
   replication kind-of makes sense because the replication approach is
   closely
   tied to real-time search - all query nodes need to see index
   changes
   fast.
But Lucene itself offers no replication mechanism, so maybe the
   replication
   is something to figure out separately, say on the Solr level, later
   on
   once
   we get there.  I think even just the essential real-time search
   requires
   substantial changes to Lucene (I remember seeing large patches in
   JIRA),
   which makes it hard to digest, understand, comment on, and
   ultimately
   commit
   (hence the luke warm response, I think).  Bringing other
   non-essential
   elements into discussion at the same time makes it more difficult t
   o
process all this new stuff, at least for me.  Am I the only one
   who
   finds
   this hard?
  
   That said, it sounds like we have some discussion going (Karl...),
   so I
   look forward to understanding more! :)
  
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
From: Yonik Seeley [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, September 4, 2008 10:13:32 AM
Subject: Re: Realtime Search for Social Networks Collaboration
   
On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
wrote:
 I also think it's got a
 lot of things now which makes integration difficult to do
 properly.
   
I agree, and that's why the major bump in version number rather
than
minor - we recognize that some features will need some amount of
rearchitecture.
   
 I think the problem with integration with SOLR is it was
 designed
 with
 a different problem set in mind than Ocean, originally the CNET
 shopping application.
   
That was the first use of Solr, but it actually existed before
that
w/o any defined use other than to be a plan B alternative to
MySQL
based search servers (that's actually where some of the parameter
names come from... the default /select URL instead of /search,
the
rows parameter, etc).
   
But you're right... some things like the replication strategy
were
designed (well, borrowed from Doug to be exact) with the idea
that it
would be OK to have slightly stale views of the data in the
range
of
minutes.  It just made things easier/possible at the time.  But
tons
of Solr and Lucene users want almost instantaneous visibility of
added
documents, if they can get it.  It's hardly restricted to social
network applications.
   
Bottom line is that Solr aims to be a general enterprise search
platform, and getting as real-time as we can get, and as scalable
as
we can get are some of the top priorities going forward.
   
-Yonik
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread Jason Rutherglen

 
 
 
  I've read Jason's Wiki as well.  Actually, I had to read it a number of
  times to understand bits and pieces of it.  I have to admit there is
  still
  some fuzziness about the whole things in my head - is Ocean something
  that
  already works, a separate project on googlecode.com?  I think so.  If
  so,
  and if you are working on getting it integrated into Lucene, would it
  make
  it less confusing to just refer to it as real-time search, so there
  is no
  confusion?
 
  If this is to be initially integrated into Lucene, why are things like
  replication, crowding/field collapsing, locallucene, name service, tag
  index, etc. all mentioned there on the Wiki and bundled with
  description of
  how real-time search works and is to be implemented?  I suppose
  mentioning
  replication kind-of makes sense because the replication approach is
  closely
  tied to real-time search - all query nodes need to see index changes
  fast.
   But Lucene itself offers no replication mechanism, so maybe the
  replication
  is something to figure out separately, say on the Solr level, later on
  once
  we get there.  I think even just the essential real-time search
  requires
  substantial changes to Lucene (I remember seeing large patches in
  JIRA),
  which makes it hard to digest, understand, comment on, and ultimately
  commit
  (hence the luke warm response, I think).  Bringing other non-essential
  elements into discussion at the same time makes it more difficult t o
   process all this new stuff, at least for me.  Am I the only one who
  finds
  this hard?
 
  That said, it sounds like we have some discussion going (Karl...), so I
  look forward to understanding more! :)
 
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
   From: Yonik Seeley [EMAIL PROTECTED]
   To: java-dev@lucene.apache.org
   Sent: Thursday, September 4, 2008 10:13:32 AM
   Subject: Re: Realtime Search for Social Networks Collaboration
  
   On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
   wrote:
I also think it's got a
lot of things now which makes integration difficult to do properly.
  
   I agree, and that's why the major bump in version number rather than
   minor - we recognize that some features will need some amount of
   rearchitecture.
  
I think the problem with integration with SOLR is it was designed
with
a different problem set in mind than Ocean, originally the CNET
shopping application.
  
   That was the first use of Solr, but it actually existed before that
   w/o any defined use other than to be a plan B alternative to MySQL
   based search servers (that's actually where some of the parameter
   names come from... the default /select URL instead of /search, the
   rows parameter, etc).
  
   But you're right... some things like the replication strategy were
   designed (well, borrowed from Doug to be exact) with the idea that it
   would be OK to have slightly stale views of the data in the range
   of
   minutes.  It just made things easier/possible at the time.  But tons
   of Solr and Lucene users want almost instantaneous visibility of
   added
   documents, if they can get it.  It's hardly restricted to social
   network applications.
  
   Bottom line is that Solr aims to be a general enterprise search
   platform, and getting as real-time as we can get, and as scalable as
   we can get are some of the top priorities going forward.
  
   -Yonik
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





 --
 --Noble Paul

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado

 search
   engines are becomming more and more like relational databases with
   full-text
   search capablities (without the benefit of full reletional algebra --
   for
   example joins are not possible using SOLR). Notice that real-time
 CRUD
   operations and transactionality are core DB concepts adn have been
   studied
   and developed by database communities for aquite long time. There has
   been
   recent efforts on how to effeciently integrate Lucene into
 releational
   databases (see Lucene JVM ORACLE integration, see
  
  
 http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
 )
  
   I think we should seriously look at joining efforts with open-source
   Database engine projects, written in Java (see
   http://java-source.net/open-source/database-engines) in order to
 blend
   IR
   and ORM for once and for all.
  
   -- Joaquin
  
  
  
   I've read Jason's Wiki as well.  Actually, I had to read it a number
 of
   times to understand bits and pieces of it.  I have to admit there is
   still
   some fuzziness about the whole things in my head - is Ocean
 something
   that
   already works, a separate project on googlecode.com?  I think so.
  If
   so,
   and if you are working on getting it integrated into Lucene, would
 it
   make
   it less confusing to just refer to it as real-time search, so
 there
   is no
   confusion?
  
   If this is to be initially integrated into Lucene, why are things
 like
   replication, crowding/field collapsing, locallucene, name service,
 tag
   index, etc. all mentioned there on the Wiki and bundled with
   description of
   how real-time search works and is to be implemented?  I suppose
   mentioning
   replication kind-of makes sense because the replication approach is
   closely
   tied to real-time search - all query nodes need to see index changes
   fast.
But Lucene itself offers no replication mechanism, so maybe the
   replication
   is something to figure out separately, say on the Solr level, later
 on
   once
   we get there.  I think even just the essential real-time search
   requires
   substantial changes to Lucene (I remember seeing large patches in
   JIRA),
   which makes it hard to digest, understand, comment on, and
 ultimately
   commit
   (hence the luke warm response, I think).  Bringing other
 non-essential
   elements into discussion at the same time makes it more difficult t
 o
process all this new stuff, at least for me.  Am I the only one who
   finds
   this hard?
  
   That said, it sounds like we have some discussion going (Karl...),
 so I
   look forward to understanding more! :)
  
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
From: Yonik Seeley [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, September 4, 2008 10:13:32 AM
Subject: Re: Realtime Search for Social Networks Collaboration
   
On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
wrote:
 I also think it's got a
 lot of things now which makes integration difficult to do
 properly.
   
I agree, and that's why the major bump in version number rather
 than
minor - we recognize that some features will need some amount of
rearchitecture.
   
 I think the problem with integration with SOLR is it was
 designed
 with
 a different problem set in mind than Ocean, originally the CNET
 shopping application.
   
That was the first use of Solr, but it actually existed before
 that
w/o any defined use other than to be a plan B alternative to
 MySQL
based search servers (that's actually where some of the parameter
names come from... the default /select URL instead of /search, the
rows parameter, etc).
   
But you're right... some things like the replication strategy were
designed (well, borrowed from Doug to be exact) with the idea that
 it
would be OK to have slightly stale views of the data in the
 range
of
minutes.  It just made things easier/possible at the time.  But
 tons
of Solr and Lucene users want almost instantaneous visibility of
added
documents, if they can get it.  It's hardly restricted to social
network applications.
   
Bottom line is that Solr aims to be a general enterprise search
platform, and getting as real-time as we can get, and as scalable
 as
we can get are some of the top priorities going forward.
   
-Yonik
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado

 could open the time
 window by
   including archives).
  
   As for SOLR and OCEAN,  I would argue that these semi-structured
 search
   engines are becomming more and more like relational databases with
   full-text
   search capablities (without the benefit of full reletional algebra
 --
   for
   example joins are not possible using SOLR). Notice that real-time
 CRUD
   operations and transactionality are core DB concepts adn have been
   studied
   and developed by database communities for aquite long time. There
 has
   been
   recent efforts on how to effeciently integrate Lucene into
 releational
   databases (see Lucene JVM ORACLE integration, see
  
  
 http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
 )
  
   I think we should seriously look at joining efforts with open-source
   Database engine projects, written in Java (see
   http://java-source.net/open-source/database-engines) in order to
 blend
   IR
   and ORM for once and for all.
  
   -- Joaquin
  
  
  
   I've read Jason's Wiki as well.  Actually, I had to read it a
 number of
   times to understand bits and pieces of it.  I have to admit there
 is
   still
   some fuzziness about the whole things in my head - is Ocean
 something
   that
   already works, a separate project on googlecode.com?  I think so.
  If
   so,
   and if you are working on getting it integrated into Lucene, would
 it
   make
   it less confusing to just refer to it as real-time search, so
 there
   is no
   confusion?
  
   If this is to be initially integrated into Lucene, why are things
 like
   replication, crowding/field collapsing, locallucene, name service,
 tag
   index, etc. all mentioned there on the Wiki and bundled with
   description of
   how real-time search works and is to be implemented?  I suppose
   mentioning
   replication kind-of makes sense because the replication approach is
   closely
   tied to real-time search - all query nodes need to see index
 changes
   fast.
But Lucene itself offers no replication mechanism, so maybe the
   replication
   is something to figure out separately, say on the Solr level, later
 on
   once
   we get there.  I think even just the essential real-time search
   requires
   substantial changes to Lucene (I remember seeing large patches in
   JIRA),
   which makes it hard to digest, understand, comment on, and
 ultimately
   commit
   (hence the luke warm response, I think).  Bringing other
 non-essential
   elements into discussion at the same time makes it more difficult t
 o
process all this new stuff, at least for me.  Am I the only one
 who
   finds
   this hard?
  
   That said, it sounds like we have some discussion going (Karl...),
 so I
   look forward to understanding more! :)
  
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
From: Yonik Seeley [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, September 4, 2008 10:13:32 AM
Subject: Re: Realtime Search for Social Networks Collaboration
   
On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
wrote:
 I also think it's got a
 lot of things now which makes integration difficult to do
 properly.
   
I agree, and that's why the major bump in version number rather
 than
minor - we recognize that some features will need some amount of
rearchitecture.
   
 I think the problem with integration with SOLR is it was
 designed
 with
 a different problem set in mind than Ocean, originally the CNET
 shopping application.
   
That was the first use of Solr, but it actually existed before
 that
w/o any defined use other than to be a plan B alternative to
 MySQL
based search servers (that's actually where some of the parameter
names come from... the default /select URL instead of /search,
 the
rows parameter, etc).
   
But you're right... some things like the replication strategy
 were
designed (well, borrowed from Doug to be exact) with the idea
 that it
would be OK to have slightly stale views of the data in the
 range
of
minutes.  It just made things easier/possible at the time.  But
 tons
of Solr and Lucene users want almost instantaneous visibility of
added
documents, if they can get it.  It's hardly restricted to social
network applications.
   
Bottom line is that Solr aims to be a general enterprise search
platform, and getting as real-time as we can get, and as scalable
 as
we can get are some of the top priorities going forward.
   
-Yonik
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED

Re: Realtime Search for Social Networks Collaboration

  some fuzziness about the whole things in my head - is Ocean something
  that
  already works, a separate project on googlecode.com?  I think so.  If
  so,
  and if you are working on getting it integrated into Lucene, would it
  make
  it less confusing to just refer to it as real-time search, so there
  is no
  confusion?
 
  If this is to be initially integrated into Lucene, why are things like
  replication, crowding/field collapsing, locallucene, name service, tag
  index, etc. all mentioned there on the Wiki and bundled with
  description of
  how real-time search works and is to be implemented?  I suppose
  mentioning
  replication kind-of makes sense because the replication approach is
  closely
  tied to real-time search - all query nodes need to see index changes
  fast.
   But Lucene itself offers no replication mechanism, so maybe the
  replication
  is something to figure out separately, say on the Solr level, later on
  once
  we get there.  I think even just the essential real-time search
  requires
  substantial changes to Lucene (I remember seeing large patches in
  JIRA),
  which makes it hard to digest, understand, comment on, and ultimately
  commit
  (hence the luke warm response, I think).  Bringing other non-essential
  elements into discussion at the same time makes it more difficult t o
   process all this new stuff, at least for me.  Am I the only one who
  finds
  this hard?
 
  That said, it sounds like we have some discussion going (Karl...), so I
  look forward to understanding more! :)
 
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
   From: Yonik Seeley [EMAIL PROTECTED]
   To: java-dev@lucene.apache.org
   Sent: Thursday, September 4, 2008 10:13:32 AM
   Subject: Re: Realtime Search for Social Networks Collaboration
  
   On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
   wrote:
I also think it's got a
lot of things now which makes integration difficult to do properly.
  
   I agree, and that's why the major bump in version number rather than
   minor - we recognize that some features will need some amount of
   rearchitecture.
  
I think the problem with integration with SOLR is it was designed
with
a different problem set in mind than Ocean, originally the CNET
shopping application.
  
   That was the first use of Solr, but it actually existed before that
   w/o any defined use other than to be a plan B alternative to MySQL
   based search servers (that's actually where some of the parameter
   names come from... the default /select URL instead of /search, the
   rows parameter, etc).
  
   But you're right... some things like the replication strategy were
   designed (well, borrowed from Doug to be exact) with the idea that it
   would be OK to have slightly stale views of the data in the range
   of
   minutes.  It just made things easier/possible at the time.  But tons
   of Solr and Lucene users want almost instantaneous visibility of
   added
   documents, if they can get it.  It's hardly restricted to social
   network applications.
  
   Bottom line is that Solr aims to be a general enterprise search
   platform, and getting as real-time as we can get, and as scalable as
   we can get are some of the top priorities going forward.
  
   -Yonik
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-- 
--Noble Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-19 Thread Jason Rutherglen

Hi Mike,

How does column stride fields work for StringIndex field caching?  I
have been working on the tag index which may be more suitable for
field caching and makes range queries faster.  It is something that
would be good to integrate into core Lucene as well.  It may be more
suitable for many situations.  Perhaps the column stride and tag index
can be merged?  What is the progress on cs?

 Reopen then must only materialize any
 buffered deletes by Term  Query, unless we decide to move up that
 materialization into the actual delete cal, since we will have
 SegmentReaders open anyway.  I think I'm leaning towards that approach...
 best to pay the cost as you go, instead of aggregated cost on reopen?

I don't follow this part.  There is an IndexReader exposed from
IndexWriter.  I think the individual SegmentReaders should be exposed
as well, I don't see any reason not to and there are many cases where
it has been frustrating that SegmentReaders are package protected.  I
am not sure from what you mentioned how the deletedDocs bitvector is
handled.

On Fri, Sep 19, 2008 at 8:30 AM, Michael McCandless
[EMAIL PROTECTED] wrote:

 Jason Rutherglen wrote:

 Mike,

 The other issue that will occur that I addressed is the field caches.
 The underlying smaller IndexReaders will need to be exposed because of
 the field caching.  Currently in ocean realtime search the individual
 readers are searched on using a MultiSearcher in order to search in
 parallel and reuse the field caches. How will field caching work with
 the IndexWriter approach?  It seems like it would need a dynamically
 growing field cache array?  That is a bit tricky.  By doing in memory
 merging in ocean, the field caches last longer and do not require
 growing arrays.

 First off, I think the combination of LUCENE-1231 and LUCENE-831, which
 should result in FieldCache that is distributed down to each SegmentReader
 and much faster to initialize, should make incrementally updating the
 FieldCache much more efficient (ie, on calling IndexReader.reopen, it should
 only be the new segments that need to populate their FieldCache).

 Hopefully these land before real-time search, because then I have more API
 flexibility to expose column-stride fields on the in-RAM documents.  There
 is still some trickiness, because an ordinary IndexWriter would never hold
 the column-stride fields in RAM.  They'd be flushed to the Directory,
 immediately per document, just liked stored fields and term vectors are
 today.  So, maybe, the first RAMReader you get from the IndexWriter would
 load back in these fields, triggering IndexWriter to add to them as
 documents are added (maybe using exponentially growing arrays as the
 underlying store, or, perhaps separate array fragments, to prevent
 synchronization when reading from them), such that subsequent reopens simply
 resync their max docID.

 How do you plan to handle rapidly delete the docs of
 the disk segments?  Can the SegmentReader clone patch be used for
 this?

 I was thinking we'd flush new .del files every time a reopen is called, but
 that could very well be costly.  Instead, we can keep the deletes pending in
 the SegmentReaders we're holding open, and then go back to flushing on
 IndexWriter's normal schedule.  Reopen then must only materialize any
 buffered deletes by Term  Query, unless we decide to move up that
 materialization into the actual delete cal, since we will have
 SegmentReaders open anyway.  I think I'm leaning towards that approach...
 best to pay the cost as you go, instead of aggregated cost on reopen?

 Mike

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-18 Thread Jason Rutherglen

Mike,

The other issue that will occur that I addressed is the field caches.
The underlying smaller IndexReaders will need to be exposed because of
the field caching.  Currently in ocean realtime search the individual
readers are searched on using a MultiSearcher in order to search in
parallel and reuse the field caches. How will field caching work with
the IndexWriter approach?  It seems like it would need a dynamically
growing field cache array?  That is a bit tricky.  By doing in memory
merging in ocean, the field caches last longer and do not require
growing arrays.  How do you plan to handle rapidly delete the docs of
the disk segments?  Can the SegmentReader clone patch be used for
this?

Jason

On Thu, Sep 11, 2008 at 8:29 AM, Michael McCandless
[EMAIL PROTECTED] wrote:

 Right, there would need to be a snapshot taken of all terms when
 IndexWriter.getReader() is called.

 This snapshot would 1) hold a frozen int docFreq per term, and 2) sort the
 terms so TermEnum can just step through them.  (We might be able to delay
 this sorting until the first time something asks for it).  Also, it must
 merge this data from all threads, since each thread holds its hash per
 field.  I've got a rough start at coding this up...

 The costs are clearly growing, in order to keep the point in time feature
 of this RAMIndexReader, but I think are still well contained unless you have
 a really huge RAM buffer.

 Flushing is still tricky because we cannot recycle the byte block buffers
 until all running TermDocs/TermPositions iterations are finished.
  Alternatively, I may just allocate new byte blocks and allow the old ones
 to be GC'd on their own once running iterations are finished.

 Mike

 Jason Rutherglen wrote:

 Hi Mike,

 There would be a new sorted list or something to replace the
 hashtable?  Seems like an issue that is not solved.

 Jason

 On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless
 [EMAIL PROTECTED] wrote:

 This would just tap into the live hashtable that DocumentsWriter*
 maintain
 for the posting lists... except the docFreq will need to be copied away
 on
 reopen, I think.

 Mike

 Jason Rutherglen wrote:

 Term dictionary?  I'm curious how that would be solved?

 On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
 [EMAIL PROTECTED] wrote:

 Yonik Seeley wrote:

 I think it's quite feasible, but, it'd still have a reopen cost in
 that
 any buffered delete by term or query would have to be materialiazed
 into
 docIDs on reopen.  Though, if this somehow turns out to be a problem,
 in
 the
 future we could do this materializing immediately, instead of
 buffering,
 if
 we already have a reader open.

 Right... it seems like re-using readers internally is something we
 could already be doing in IndexWriter.

 True.

 Flushing is somewhat tricky because any open RAM readers would then
 have
 to
 cutover to the newly flushed segment once the flush completes, so
 that
 the
 RAM buffer can be recycled for the next segment.

 Re-use of a RAM buffer doesn't seem like such a big deal.

 But, how would you maintain a static view of an index...?

 IndexReader r1 = indexWriter.getCurrentIndex()
 indexWriter.addDocument(...)
 IndexReader r2 = indexWriter.getCurrentIndex()

 I assume r1 will have a view of the index before the document was
 added, and r2 after?

 Right, getCurrentIndex would return a MultiReader that includes
 SegmentReader for each segment in the index, plus a RAMReader that
 searches the RAM buffer.  That RAMReader is a tiny shell class that
 would
 basically just record the max docID it's allowed to go up to (the docID
 as
 of when it was opened), and stop enumerating docIDs (eg in the
 TermDocs)
 when it hits a docID beyond that limit.

 For reading stored fields and term vectors, which are now flushed
 immediately to disk, we need to somehow get an IndexInput from the
 IndexOutputs that IndexWriter holds open on these files.  Or, maybe,
 just
 open new IndexInputs?

 Another thing that will help is if users could get their hands on the
 sub-readers of a multi-segment reader.  Right now that is hidden in
 MultiSegmentReader and makes updating anything incrementally
 difficult.

 Besides what's handled by MultiSegmentReader.reopen already, what else
 do
 you need to incrementally update?

 Mike

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-11 Thread Michael McCandless



Right, there would need to be a snapshot taken of all terms when  
IndexWriter.getReader() is called.


This snapshot would 1) hold a frozen int docFreq per term, and 2) sort  
the terms so TermEnum can just step through them.  (We might be able  
to delay this sorting until the first time something asks for it).   
Also, it must merge this data from all threads, since each thread  
holds its hash per field.  I've got a rough start at coding this up...


The costs are clearly growing, in order to keep the point in time  
feature of this RAMIndexReader, but I think are still well contained  
unless you have a really huge RAM buffer.


Flushing is still tricky because we cannot recycle the byte block  
buffers until all running TermDocs/TermPositions iterations are  
finished.  Alternatively, I may just allocate new byte blocks and  
allow the old ones to be GC'd on their own once running iterations are  
finished.


Mike

Jason Rutherglen wrote:


Hi Mike,

There would be a new sorted list or something to replace the
hashtable?  Seems like an issue that is not solved.

Jason

On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless
[EMAIL PROTECTED] wrote:


This would just tap into the live hashtable that DocumentsWriter*  
maintain
for the posting lists... except the docFreq will need to be copied  
away on

reopen, I think.

Mike

Jason Rutherglen wrote:


Term dictionary?  I'm curious how that would be solved?

On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
[EMAIL PROTECTED] wrote:


Yonik Seeley wrote:

I think it's quite feasible, but, it'd still have a reopen  
cost in

that
any buffered delete by term or query would have to be  
materialiazed

into
docIDs on reopen.  Though, if this somehow turns out to be a  
problem,

in
the
future we could do this materializing immediately, instead of
buffering,
if
we already have a reader open.


Right... it seems like re-using readers internally is something we
could already be doing in IndexWriter.


True.

Flushing is somewhat tricky because any open RAM readers would  
then

have
to
cutover to the newly flushed segment once the flush completes,  
so that

the
RAM buffer can be recycled for the next segment.


Re-use of a RAM buffer doesn't seem like such a big deal.

But, how would you maintain a static view of an index...?

IndexReader r1 = indexWriter.getCurrentIndex()
indexWriter.addDocument(...)
IndexReader r2 = indexWriter.getCurrentIndex()

I assume r1 will have a view of the index before the document was
added, and r2 after?


Right, getCurrentIndex would return a MultiReader that includes
SegmentReader for each segment in the index, plus a RAMReader  
that
searches the RAM buffer.  That RAMReader is a tiny shell class  
that would
basically just record the max docID it's allowed to go up to (the  
docID

as
of when it was opened), and stop enumerating docIDs (eg in the  
TermDocs)

when it hits a docID beyond that limit.

For reading stored fields and term vectors, which are now flushed
immediately to disk, we need to somehow get an IndexInput from the
IndexOutputs that IndexWriter holds open on these files.  Or,  
maybe, just

open new IndexInputs?

Another thing that will help is if users could get their hands  
on the
sub-readers of a multi-segment reader.  Right now that is hidden  
in

MultiSegmentReader and makes updating anything incrementally
difficult.


Besides what's handled by MultiSegmentReader.reopen already, what  
else do

you need to incrementally update?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-10 Thread Jason Rutherglen

Hi Mike,

There would be a new sorted list or something to replace the
hashtable?  Seems like an issue that is not solved.

Jason

On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless
[EMAIL PROTECTED] wrote:

 This would just tap into the live hashtable that DocumentsWriter* maintain
 for the posting lists... except the docFreq will need to be copied away on
 reopen, I think.

 Mike

 Jason Rutherglen wrote:

 Term dictionary?  I'm curious how that would be solved?

 On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
 [EMAIL PROTECTED] wrote:

 Yonik Seeley wrote:

 I think it's quite feasible, but, it'd still have a reopen cost in
 that
 any buffered delete by term or query would have to be materialiazed
 into
 docIDs on reopen.  Though, if this somehow turns out to be a problem,
 in
 the
 future we could do this materializing immediately, instead of
 buffering,
 if
 we already have a reader open.

 Right... it seems like re-using readers internally is something we
 could already be doing in IndexWriter.

 True.

 Flushing is somewhat tricky because any open RAM readers would then
 have
 to
 cutover to the newly flushed segment once the flush completes, so that
 the
 RAM buffer can be recycled for the next segment.

 Re-use of a RAM buffer doesn't seem like such a big deal.

 But, how would you maintain a static view of an index...?

 IndexReader r1 = indexWriter.getCurrentIndex()
 indexWriter.addDocument(...)
 IndexReader r2 = indexWriter.getCurrentIndex()

 I assume r1 will have a view of the index before the document was
 added, and r2 after?

 Right, getCurrentIndex would return a MultiReader that includes
 SegmentReader for each segment in the index, plus a RAMReader that
 searches the RAM buffer.  That RAMReader is a tiny shell class that would
 basically just record the max docID it's allowed to go up to (the docID
 as
 of when it was opened), and stop enumerating docIDs (eg in the TermDocs)
 when it hits a docID beyond that limit.

 For reading stored fields and term vectors, which are now flushed
 immediately to disk, we need to somehow get an IndexInput from the
 IndexOutputs that IndexWriter holds open on these files.  Or, maybe, just
 open new IndexInputs?

 Another thing that will help is if users could get their hands on the
 sub-readers of a multi-segment reader.  Right now that is hidden in
 MultiSegmentReader and makes updating anything incrementally
 difficult.

 Besides what's handled by MultiSegmentReader.reopen already, what else do
 you need to incrementally update?

 Mike

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration



Yonik Seeley wrote:


On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
[EMAIL PROTECTED] wrote:

Right, getCurrentIndex would return a MultiReader that includes
SegmentReader for each segment in the index, plus a RAMReader that
searches the RAM buffer.  That RAMReader is a tiny shell class that  
would
basically just record the max docID it's allowed to go up to (the  
docID as
of when it was opened), and stop enumerating docIDs (eg in the  
TermDocs)

when it hits a docID beyond that limit.


What about something like term freq?  Would it need to count the
number of docs after the local maxDoc or is there a better way?


Good question...

I think we'd have to take a full copy of the term - termFreq on  
reopen?  I don't see how else to do it (I don't understand your  
suggestion above).  So, this will clearly add to the cost of reopen.



For reading stored fields and term vectors, which are now flushed
immediately to disk, we need to somehow get an IndexInput from the
IndexOutputs that IndexWriter holds open on these files.  Or,  
maybe, just

open new IndexInputs?


Hmmm, seems like a case of our nice and simple Directory model not
having quite enough features in this case.


I think we can simply open IndexInputs on these files.  I believe Java  
does the right thing on windows, such that if we are already writing  
to the file, it does not prevent another file handle from opening the  
file for reading.


Another thing that will help is if users could get their hands on  
the

sub-readers of a multi-segment reader.  Right now that is hidden in
MultiSegmentReader and makes updating anything incrementally
difficult.


Besides what's handled by MultiSegmentReader.reopen already, what  
else do

you need to incrementally update?


Anything that you want to incrementally update and uses an  
IndexReader as a key.

Mostly caches I would think... Solr has user-level (application
specific) caches, faceting caches, etc.


Ahh ok.  We should just open up access and mark this as advanced?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration



This would just tap into the live hashtable that DocumentsWriter*  
maintain for the posting lists... except the docFreq will need to be  
copied away on reopen, I think.


Mike

Jason Rutherglen wrote:


Term dictionary?  I'm curious how that would be solved?

On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
[EMAIL PROTECTED] wrote:


Yonik Seeley wrote:

I think it's quite feasible, but, it'd still have a reopen cost  
in that
any buffered delete by term or query would have to be  
materialiazed

into
docIDs on reopen.  Though, if this somehow turns out to be a  
problem, in

the
future we could do this materializing immediately, instead of  
buffering,

if
we already have a reader open.


Right... it seems like re-using readers internally is something we
could already be doing in IndexWriter.


True.

Flushing is somewhat tricky because any open RAM readers would  
then have

to
cutover to the newly flushed segment once the flush completes, so  
that

the
RAM buffer can be recycled for the next segment.


Re-use of a RAM buffer doesn't seem like such a big deal.

But, how would you maintain a static view of an index...?

IndexReader r1 = indexWriter.getCurrentIndex()
indexWriter.addDocument(...)
IndexReader r2 = indexWriter.getCurrentIndex()

I assume r1 will have a view of the index before the document was
added, and r2 after?


Right, getCurrentIndex would return a MultiReader that includes
SegmentReader for each segment in the index, plus a RAMReader that
searches the RAM buffer.  That RAMReader is a tiny shell class that  
would
basically just record the max docID it's allowed to go up to (the  
docID as
of when it was opened), and stop enumerating docIDs (eg in the  
TermDocs)

when it hits a docID beyond that limit.

For reading stored fields and term vectors, which are now flushed
immediately to disk, we need to somehow get an IndexInput from the
IndexOutputs that IndexWriter holds open on these files.  Or,  
maybe, just

open new IndexInputs?

Another thing that will help is if users could get their hands on  
the

sub-readers of a multi-segment reader.  Right now that is hidden in
MultiSegmentReader and makes updating anything incrementally
difficult.


Besides what's handled by MultiSegmentReader.reopen already, what  
else do

you need to incrementally update?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

On Tue, Sep 9, 2008 at 5:28 AM, Michael McCandless
[EMAIL PROTECTED] wrote:
 Yonik Seeley wrote:
 What about something like term freq?  Would it need to count the
 number of docs after the local maxDoc or is there a better way?

 Good question...

 I think we'd have to take a full copy of the term - termFreq on reopen?  I
 don't see how else to do it (I don't understand your suggestion above).  So,
 this will clearly add to the cost of reopen.

One could adjust the freq by iterating over the terms documents...
skipTo(localMaxDoc) and count how many are after that, then subtract
from the freq.  I didn't say it was a *good* idea :-)

 For reading stored fields and term vectors, which are now flushed
 immediately to disk, we need to somehow get an IndexInput from the
 IndexOutputs that IndexWriter holds open on these files.  Or, maybe, just
 open new IndexInputs?

 Hmmm, seems like a case of our nice and simple Directory model not
 having quite enough features in this case.

 I think we can simply open IndexInputs on these files.  I believe Java does
 the right thing on windows, such that if we are already writing to the file,
 it does not prevent another file handle from opening the file for reading.

Yeah, I think the underlying RandomAccessFile might do the right
thing, but IndexInput isn't required to see any changes on the fly
(and current implementations don't) so at a minimum it would be a
change of IndexInput semantics.  Maybe there would need to be a
refresh() function added, or we would need to require a specific
Directory impl?

OR, if all writes are append-only, perhaps we don't ever need to
invalidate the read buffer and would just need to remove the current
logic that caches the file length and then let the underlying
RandomAccessFile do the EOF checking.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Ning Li

On Mon, Sep 8, 2008 at 4:23 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 I thought an index reader which supports real-time search no longer
 maintains a static view of an index?

 It seems advantageous to just make it really cheap to get a new view
 of the index (if you do it for every search, t amounts to the same
 thing, right?)

Sounds like these light-weight views of the index are backed up by
something dynamic, right?


 Quite a bit of code in Lucene assumes a static view of
 the Index I think (even IndexSearcher), and it's nice to have a stable
 index view for the duration of a single request.

Agree.


On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley [EMAIL PROTECTED] wrote:
 Yeah, I think the underlying RandomAccessFile might do the right
 thing, but IndexInput isn't required to see any changes on the fly
 (and current implementations don't) so at a minimum it would be a
 change of IndexInput semantics.  Maybe there would need to be a
 refresh() function added, or we would need to require a specific
 Directory impl?

 OR, if all writes are append-only, perhaps we don't ever need to
 invalidate the read buffer and would just need to remove the current
 logic that caches the file length and then let the underlying
 RandomAccessFile do the EOF checking.

We cannot assume it's always RandomAccessFile, can we?
So we may have to flush after writing each document. Even so,
this may not be sufficient for some FS such as HDFS... Is it
reasonable in this case to keep in memory everything including
stored fields and term vectors?


Cheers,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

On Tue, Sep 9, 2008 at 11:42 AM, Ning Li [EMAIL PROTECTED] wrote:
 On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley [EMAIL PROTECTED] wrote:
 Yeah, I think the underlying RandomAccessFile might do the right
 thing, but IndexInput isn't required to see any changes on the fly
 (and current implementations don't) so at a minimum it would be a
 change of IndexInput semantics.  Maybe there would need to be a
 refresh() function added, or we would need to require a specific
 Directory impl?

 OR, if all writes are append-only, perhaps we don't ever need to
 invalidate the read buffer and would just need to remove the current
 logic that caches the file length and then let the underlying
 RandomAccessFile do the EOF checking.

 We cannot assume it's always RandomAccessFile, can we?

No, it would essentially be a change in the semantics that all
implementations would need to support.

 So we may have to flush after writing each document.

Flush when creating a new index view (which could possibly be after
every document is added, but doesn't have to be).

 Even so,
 this may not be sufficient for some FS such as HDFS... Is it
 reasonable in this case to keep in memory everything including
 stored fields and term vectors?

We could maybe do something like a proxy IndexInput/IndexOutput that
would allow updating the read buffer from the writer buffer.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration



Yonik Seeley wrote:


On Tue, Sep 9, 2008 at 5:28 AM, Michael McCandless
[EMAIL PROTECTED] wrote:

Yonik Seeley wrote:

What about something like term freq?  Would it need to count the
number of docs after the local maxDoc or is there a better way?


Good question...

I think we'd have to take a full copy of the term - termFreq on  
reopen?  I
don't see how else to do it (I don't understand your suggestion  
above).  So,

this will clearly add to the cost of reopen.


One could adjust the freq by iterating over the terms documents...
skipTo(localMaxDoc) and count how many are after that, then subtract
from the freq.  I didn't say it was a *good* idea :-)


Ahh, OK :)


For reading stored fields and term vectors, which are now flushed
immediately to disk, we need to somehow get an IndexInput from the
IndexOutputs that IndexWriter holds open on these files.  Or,  
maybe, just

open new IndexInputs?


Hmmm, seems like a case of our nice and simple Directory model not
having quite enough features in this case.


I think we can simply open IndexInputs on these files.  I believe  
Java does
the right thing on windows, such that if we are already writing to  
the file,
it does not prevent another file handle from opening the file for  
reading.


Yeah, I think the underlying RandomAccessFile might do the right
thing, but IndexInput isn't required to see any changes on the fly
(and current implementations don't) so at a minimum it would be a
change of IndexInput semantics.  Maybe there would need to be a
refresh() function added, or we would need to require a specific
Directory impl?

OR, if all writes are append-only, perhaps we don't ever need to
invalidate the read buffer and would just need to remove the current
logic that caches the file length and then let the underlying
RandomAccessFile do the EOF checking.


All writes to these files are append only, and, when we open the  
IndexInput we would never read beyond it's current length (once we  
flush our IndexOutput) because that's the local maxDocID limit.


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration



Yonik Seeley wrote:


On Tue, Sep 9, 2008 at 11:42 AM, Ning Li [EMAIL PROTECTED] wrote:
On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley [EMAIL PROTECTED]  
wrote:

Yeah, I think the underlying RandomAccessFile might do the right
thing, but IndexInput isn't required to see any changes on the fly
(and current implementations don't) so at a minimum it would be a
change of IndexInput semantics.  Maybe there would need to be a
refresh() function added, or we would need to require a specific
Directory impl?

OR, if all writes are append-only, perhaps we don't ever need to
invalidate the read buffer and would just need to remove the current
logic that caches the file length and then let the underlying
RandomAccessFile do the EOF checking.


We cannot assume it's always RandomAccessFile, can we?


No, it would essentially be a change in the semantics that all
implementations would need to support.


Right, which is you are allowed to open an IndexInput on a file when  
an IndexOutput has that same file open and is still appending to it.



So we may have to flush after writing each document.


Flush when creating a new index view (which could possibly be after
every document is added, but doesn't have to be).


Assuming we can make the above semantics requirement change to  
IndexInput, we don't need to flush on opening a new RAM reader?



Even so,
this may not be sufficient for some FS such as HDFS... Is it
reasonable in this case to keep in memory everything including
stored fields and term vectors?


We could maybe do something like a proxy IndexInput/IndexOutput that
would allow updating the read buffer from the writer buffer.


Does HDFS disallow a reader from reading a file that's still open for  
append?


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

On Tue, Sep 9, 2008 at 12:41 PM, Michael McCandless
[EMAIL PROTECTED] wrote:
 Yonik Seeley wrote:
 OR, if all writes are append-only, perhaps we don't ever need to
 invalidate the read buffer and would just need to remove the current
 logic that caches the file length and then let the underlying
 RandomAccessFile do the EOF checking.

 All writes to these files are append only, and, when we open the IndexInput
 we would never read beyond it's current length (once we flush our
 IndexOutput) because that's the local maxDocID limit.

Right, but it would be nice to not have to open a new IndexInput for
each snapshot... opening a file is not a quick operation.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

On Tue, Sep 9, 2008 at 12:45 PM, Michael McCandless
[EMAIL PROTECTED] wrote:
 Yonik Seeley wrote:
 No, it would essentially be a change in the semantics that all
 implementations would need to support.

 Right, which is you are allowed to open an IndexInput on a file when an
 IndexOutput has that same file open and is still appending to it.

Not just that, but that the size can actually grow after the
IndexInput has been opened, and that should be visible.  That would
seem necessary for sharing the IndexInput (via a clone).

 So we may have to flush after writing each document.

 Flush when creating a new index view (which could possibly be after
 every document is added, but doesn't have to be).

 Assuming we can make the above semantics requirement change to IndexInput,
 we don't need to flush on opening a new RAM reader?

Yes, we would need to flush... I was just pointing out that you don't
necessarily need a new RAM reader for every document added (but that
is the worst case scenario).

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Ning Li

 Even so,
 this may not be sufficient for some FS such as HDFS... Is it
 reasonable in this case to keep in memory everything including
 stored fields and term vectors?

 We could maybe do something like a proxy IndexInput/IndexOutput that
 would allow updating the read buffer from the writer buffer.

 Does HDFS disallow a reader from reading a file that's still open for
 append?

HDFS allows that. A reader is guaranteed to be able to read data that
was 'flushed' before the reader opened the file. However, it may not
see the latest appends (after open) even if they are flushed. Yonik's
comments below also apply in this case.

 Right, but it would be nice to not have to open a new IndexInput for
 each snapshot... opening a file is not a quick operation.

Cheers,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

Hi Joaquin,

Using HBase with realtime Lucene would be in line with what Google
does.  However the question is whether or not this is completely
necessary or the most simple approach.  That probably can only be
answered by doing a live comparison of the two!  Unfortunately that
would require probably quite a bit of work and resources.  For now,
Ocean stores the data in the Lucene indexes because it works, it's
easy to implement etc.  I have looked at other options, however they
need to be prioritized in terms of need vs cost.  I would put the
HBase solution possibly at the high end of the resource scale.  I
think usually it's best to keep things as simple as possible and as
cheap as possible.  More complexity in a scalable realtime search
solution would mean more people, more expertise, and more
possibilities for breakage.  It would need to be clear what HBase or
other solutions for storing the data brought to the table, which
because I don't have time to look at them, I cannot answer.
Nonetheless it is somewhat interesting.

Cheers,
Jason Rutherglen

On Sun, Sep 7, 2008 at 11:16 AM, J. Delgado [EMAIL PROTECTED] wrote:
 On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED]
 wrote:

 for example joins are not possible using SOLR).

 It's largely *because* Lucene doesn't do joins that it can be made to
 scale out. I've replaced two large-scale database systems this year with
 distributed Lucene solutions because this scale-out architecture provided
 significantly better performance. These were semi-structured systems too.
 Lucene's comparitively simplistic data model/query model is both a weakness
 and a strength in this regard.

  Hey, maybe the right way to go for a truly scalable and high performance
 semi-structured database is to marry HBase (Big-table like data storage)
 with SOLR/Lucene.I concur with you in the sense that simplistic data models
 coupled with high performance are the killer.

 Let me quote this from the original Bigtable paper from Google:

  Bigtable does not support a full relational data model; instead, it
 provides clients with a simple data model that supports dynamic control over
 data layout and format, and allows clients to reason about the locality
 properties of the data represented in the underlying storage. Data is
 indexed using row and column names that can be arbitrary strings. Bigtable
 also treats data as uninterpreted strings, although clients often serialize
 various forms of structured and semi-structured data into these strings.
 Clients can control the locality of their data through careful choices in
 their schemas. Finally, Bigtable schema parameters let clients dynamically
 control whether to serve data out of memory or from disk.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Ning Li

Hi,

We experimented using HBase's scalable infrastructure to scale out Lucene:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg01143.html

There is the concern on the impact of HDFS's random read performance
on Lucene search performance. And we can discuss if HBase's architecture
is best for scale-out Lucene. But to me, the general idea of reusing a scalable
infrastructure (if a suitable one exits) is appealing - such an infrastructure
already handles repartitioning for scalability, fault tolerance etc.

I agree with Otis that the first step for Lucene is probably to
support real-time
search. The instantiated index in contrib seems to be something close...

Cheers,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Mark Miller


Ning Li wrote:


I agree with Otis that the first step for Lucene is probably to
support real-time
search. The instantiated index in contrib seems to be something close..
Maybe we should start fleshing out what we want in realtime search on 
the wiki?


Could it be as simple as making InstantiatedIndex realtime (allow 
writes/read at same time?). Then you could search over your IndexReader 
as well as the InstantiatedIndex. Writes go to both the Writer and the 
InstantiatedIndex. Nothing is actually permanent until the true commit, 
but stuff is visible pretty fast...a new IndexReader view starts a fresh 
InstantiedIndex...


Jasons realtime patch is still pretty large...would be nice if we could 
accomplish this with as few changes as possible...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

InstantiatedIndex isn't quite realtime.  Instead a new
InstantiatedIndex is created per transaction in Ocean and managed
thereafter.  This however is fairly easy to build and could offer
realtime in Lucene without adding the transaction logging.  It would
be good to find out what scope is acceptable for a Lucene core version
of realtime.  Perhaps this basic feature set is good enough.

On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller [EMAIL PROTECTED] wrote:
 Ning Li wrote:

 I agree with Otis that the first step for Lucene is probably to
 support real-time
 search. The instantiated index in contrib seems to be something close..

 Maybe we should start fleshing out what we want in realtime search on the
 wiki?

 Could it be as simple as making InstantiatedIndex realtime (allow
 writes/read at same time?). Then you could search over your IndexReader as
 well as the InstantiatedIndex. Writes go to both the Writer and the
 InstantiatedIndex. Nothing is actually permanent until the true commit, but
 stuff is visible pretty fast...a new IndexReader view starts a fresh
 InstantiedIndex...

 Jasons realtime patch is still pretty large...would be nice if we could
 accomplish this with as few changes as possible...

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Michael McCandless



I'd also trying to make time to explore the approach of creating an  
IndexReader impl. that searches IndexWriter's RAM buffer.


I think it's quite feasible, but, it'd still have a reopen cost in  
that any buffered delete by term or query would have to be  
materialiazed into docIDs on reopen.  Though, if this somehow turns  
out to be a problem, in the future we could do this materializing  
immediately, instead of buffering, if we already have a reader open.


Flushing is somewhat tricky because any open RAM readers would then  
have to cutover to the newly flushed segment once the flush completes,  
so that the RAM buffer can be recycled for the next segment.


Mike

Jason Rutherglen wrote:


InstantiatedIndex isn't quite realtime.  Instead a new
InstantiatedIndex is created per transaction in Ocean and managed
thereafter.  This however is fairly easy to build and could offer
realtime in Lucene without adding the transaction logging.  It would
be good to find out what scope is acceptable for a Lucene core version
of realtime.  Perhaps this basic feature set is good enough.

On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller [EMAIL PROTECTED]  
wrote:

Ning Li wrote:


I agree with Otis that the first step for Lucene is probably to
support real-time
search. The instantiated index in contrib seems to be something  
close..


Maybe we should start fleshing out what we want in realtime search  
on the

wiki?

Could it be as simple as making InstantiatedIndex realtime (allow
writes/read at same time?). Then you could search over your  
IndexReader as

well as the InstantiatedIndex. Writes go to both the Writer and the
InstantiatedIndex. Nothing is actually permanent until the true  
commit, but

stuff is visible pretty fast...a new IndexReader view starts a fresh
InstantiedIndex...

Jasons realtime patch is still pretty large...would be nice if we  
could

accomplish this with as few changes as possible...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Yonik Seeley

On Mon, Sep 8, 2008 at 12:33 PM, Michael McCandless
[EMAIL PROTECTED] wrote:
 I'd also trying to make time to explore the approach of creating an
 IndexReader impl. that searches IndexWriter's RAM buffer.

That seems like it could possibly be the best performing approach in
the long run.

 I think it's quite feasible, but, it'd still have a reopen cost in that
 any buffered delete by term or query would have to be materialiazed into
 docIDs on reopen.  Though, if this somehow turns out to be a problem, in the
 future we could do this materializing immediately, instead of buffering, if
 we already have a reader open.

Right... it seems like re-using readers internally is something we
could already be doing in IndexWriter.


 Flushing is somewhat tricky because any open RAM readers would then have to
 cutover to the newly flushed segment once the flush completes, so that the
 RAM buffer can be recycled for the next segment.

Re-use of a RAM buffer doesn't seem like such a big deal.

But, how would you maintain a static view of an index...?

IndexReader r1 = indexWriter.getCurrentIndex()
indexWriter.addDocument(...)
IndexReader r2 = indexWriter.getCurrentIndex()

I assume r1 will have a view of the index before the document was
added, and r2 after?

Another thing that will help is if users could get their hands on the
sub-readers of a multi-segment reader.  Right now that is hidden in
MultiSegmentReader and makes updating anything incrementally
difficult.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Karl Wettin

I need to point out that the only thing I know InstantiatedIndex to be  
great at is read access in the inverted index. It consumes a lot more  
heap than RAMDirectory and InstantiatedIndexWriter is slightly less  
efficient than IndexWriter.


Please let me know if your experience differs from the above statement.

8 sep 2008 kl. 16.36 skrev Jason Rutherglen:


InstantiatedIndex isn't quite realtime.  Instead a new
InstantiatedIndex is created per transaction in Ocean and managed
thereafter.  This however is fairly easy to build and could offer
realtime in Lucene without adding the transaction logging.  It would
be good to find out what scope is acceptable for a Lucene core version
of realtime.  Perhaps this basic feature set is good enough.

On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller [EMAIL PROTECTED]  
wrote:

Ning Li wrote:


I agree with Otis that the first step for Lucene is probably to
support real-time
search. The instantiated index in contrib seems to be something  
close..


Maybe we should start fleshing out what we want in realtime search  
on the

wiki?

Could it be as simple as making InstantiatedIndex realtime (allow
writes/read at same time?). Then you could search over your  
IndexReader as

well as the InstantiatedIndex. Writes go to both the Writer and the
InstantiatedIndex. Nothing is actually permanent until the true  
commit, but

stuff is visible pretty fast...a new IndexReader view starts a fresh
InstantiedIndex...

Jasons realtime patch is still pretty large...would be nice if we  
could

accomplish this with as few changes as possible...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Michael McCandless



Yonik Seeley wrote:

I think it's quite feasible, but, it'd still have a reopen cost  
in that
any buffered delete by term or query would have to be  
materialiazed into
docIDs on reopen.  Though, if this somehow turns out to be a  
problem, in the
future we could do this materializing immediately, instead of  
buffering, if

we already have a reader open.


Right... it seems like re-using readers internally is something we
could already be doing in IndexWriter.


True.

Flushing is somewhat tricky because any open RAM readers would then  
have to
cutover to the newly flushed segment once the flush completes, so  
that the

RAM buffer can be recycled for the next segment.


Re-use of a RAM buffer doesn't seem like such a big deal.

But, how would you maintain a static view of an index...?

IndexReader r1 = indexWriter.getCurrentIndex()
indexWriter.addDocument(...)
IndexReader r2 = indexWriter.getCurrentIndex()

I assume r1 will have a view of the index before the document was
added, and r2 after?


Right, getCurrentIndex would return a MultiReader that includes  
SegmentReader for each segment in the index, plus a RAMReader that  
searches the RAM buffer.  That RAMReader is a tiny shell class that  
would basically just record the max docID it's allowed to go up to  
(the docID as of when it was opened), and stop enumerating docIDs (eg  
in the TermDocs) when it hits a docID beyond that limit.


For reading stored fields and term vectors, which are now flushed  
immediately to disk, we need to somehow get an IndexInput from the  
IndexOutputs that IndexWriter holds open on these files.  Or, maybe,  
just open new IndexInputs?



Another thing that will help is if users could get their hands on the
sub-readers of a multi-segment reader.  Right now that is hidden in
MultiSegmentReader and makes updating anything incrementally
difficult.


Besides what's handled by MultiSegmentReader.reopen already, what else  
do you need to incrementally update?


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Ning Li

On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 But, how would you maintain a static view of an index...?

 IndexReader r1 = indexWriter.getCurrentIndex()
 indexWriter.addDocument(...)
 IndexReader r2 = indexWriter.getCurrentIndex()

 I assume r1 will have a view of the index before the document was
 added, and r2 after?

I thought an index reader which supports real-time search no longer
maintains a static view of an index? Similar to InstantiatedIndexReader,
it will be in sync with an index writer.

IndexReader r = indexWriter.getIndexReader();
getIndexReader() (i.e. get real-time index reader) returns the same
reader instance for a writer instance.

On Mon, Sep 8, 2008 at 12:33 PM, Michael McCandless
[EMAIL PROTECTED] wrote:
 Flushing is somewhat tricky because any open RAM readers would then have to
 cutover to the newly flushed segment once the flush completes, so that the
 RAM buffer can be recycled for the next segment.

Now this won't be a problem any more.

Cheers,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Yonik Seeley

On Mon, Sep 8, 2008 at 3:56 PM, Ning Li [EMAIL PROTECTED] wrote:
 On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 But, how would you maintain a static view of an index...?

 IndexReader r1 = indexWriter.getCurrentIndex()
 indexWriter.addDocument(...)
 IndexReader r2 = indexWriter.getCurrentIndex()

 I assume r1 will have a view of the index before the document was
 added, and r2 after?

 I thought an index reader which supports real-time search no longer
 maintains a static view of an index?

It seems advantageous to just make it really cheap to get a new view
of the index (if you do it for every search, t amounts to the same
thing, right?)  Quite a bit of code in Lucene assumes a static view of
the Index I think (even IndexSearcher), and it's nice to have a stable
index view for the duration of a single request.

 Similar to InstantiatedIndexReader,
 it will be in sync with an index writer.

Right... that's why I was clarifying.  You can still make stable views
of the index with multiple InstantiatedIndex instances, but it doesn't
seem as efficient.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

That sounds about correct and I don't think it matters much.  I keep
the documents by default stored in InstantiatedIndex to 100.  So the
heap size doesn't become a problem.

On Mon, Sep 8, 2008 at 2:58 PM, Karl Wettin [EMAIL PROTECTED] wrote:
 I need to point out that the only thing I know InstantiatedIndex to be great
 at is read access in the inverted index. It consumes a lot more heap than
 RAMDirectory and InstantiatedIndexWriter is slightly less efficient than
 IndexWriter.

 Please let me know if your experience differs from the above statement.

 8 sep 2008 kl. 16.36 skrev Jason Rutherglen:

 InstantiatedIndex isn't quite realtime.  Instead a new
 InstantiatedIndex is created per transaction in Ocean and managed
 thereafter.  This however is fairly easy to build and could offer
 realtime in Lucene without adding the transaction logging.  It would
 be good to find out what scope is acceptable for a Lucene core version
 of realtime.  Perhaps this basic feature set is good enough.

 On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller [EMAIL PROTECTED]
 wrote:

 Ning Li wrote:

 I agree with Otis that the first step for Lucene is probably to
 support real-time
 search. The instantiated index in contrib seems to be something close..

 Maybe we should start fleshing out what we want in realtime search on the
 wiki?

 Could it be as simple as making InstantiatedIndex realtime (allow
 writes/read at same time?). Then you could search over your IndexReader
 as
 well as the InstantiatedIndex. Writes go to both the Writer and the
 InstantiatedIndex. Nothing is actually permanent until the true commit,
 but
 stuff is visible pretty fast...a new IndexReader view starts a fresh
 InstantiedIndex...

 Jasons realtime patch is still pretty large...would be nice if we could
 accomplish this with as few changes as possible...

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

Term dictionary?  I'm curious how that would be solved?

On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
[EMAIL PROTECTED] wrote:

 Yonik Seeley wrote:

 I think it's quite feasible, but, it'd still have a reopen cost in that
 any buffered delete by term or query would have to be materialiazed
 into
 docIDs on reopen.  Though, if this somehow turns out to be a problem, in
 the
 future we could do this materializing immediately, instead of buffering,
 if
 we already have a reader open.

 Right... it seems like re-using readers internally is something we
 could already be doing in IndexWriter.

 True.

 Flushing is somewhat tricky because any open RAM readers would then have
 to
 cutover to the newly flushed segment once the flush completes, so that
 the
 RAM buffer can be recycled for the next segment.

 Re-use of a RAM buffer doesn't seem like such a big deal.

 But, how would you maintain a static view of an index...?

 IndexReader r1 = indexWriter.getCurrentIndex()
 indexWriter.addDocument(...)
 IndexReader r2 = indexWriter.getCurrentIndex()

 I assume r1 will have a view of the index before the document was
 added, and r2 after?

 Right, getCurrentIndex would return a MultiReader that includes
 SegmentReader for each segment in the index, plus a RAMReader that
 searches the RAM buffer.  That RAMReader is a tiny shell class that would
 basically just record the max docID it's allowed to go up to (the docID as
 of when it was opened), and stop enumerating docIDs (eg in the TermDocs)
 when it hits a docID beyond that limit.

 For reading stored fields and term vectors, which are now flushed
 immediately to disk, we need to somehow get an IndexInput from the
 IndexOutputs that IndexWriter holds open on these files.  Or, maybe, just
 open new IndexInputs?

 Another thing that will help is if users could get their hands on the
 sub-readers of a multi-segment reader.  Right now that is hidden in
 MultiSegmentReader and makes updating anything incrementally
 difficult.

 Besides what's handled by MultiSegmentReader.reopen already, what else do
 you need to incrementally update?

 Mike

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Yonik Seeley

On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
[EMAIL PROTECTED] wrote:
 Right, getCurrentIndex would return a MultiReader that includes
 SegmentReader for each segment in the index, plus a RAMReader that
 searches the RAM buffer.  That RAMReader is a tiny shell class that would
 basically just record the max docID it's allowed to go up to (the docID as
 of when it was opened), and stop enumerating docIDs (eg in the TermDocs)
 when it hits a docID beyond that limit.

What about something like term freq?  Would it need to count the
number of docs after the local maxDoc or is there a better way?

 For reading stored fields and term vectors, which are now flushed
 immediately to disk, we need to somehow get an IndexInput from the
 IndexOutputs that IndexWriter holds open on these files.  Or, maybe, just
 open new IndexInputs?

Hmmm, seems like a case of our nice and simple Directory model not
having quite enough features in this case.

 Another thing that will help is if users could get their hands on the
 sub-readers of a multi-segment reader.  Right now that is hidden in
 MultiSegmentReader and makes updating anything incrementally
 difficult.

 Besides what's handled by MultiSegmentReader.reopen already, what else do
 you need to incrementally update?

Anything that you want to incrementally update and uses an IndexReader as a key.
Mostly caches I would think... Solr has user-level (application
specific) caches, faceting caches, etc.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

 have some discussion going (Karl...), so I
 look forward to understanding more! :)


 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: Yonik Seeley [EMAIL PROTECTED]
  To: java-dev@lucene.apache.org
  Sent: Thursday, September 4, 2008 10:13:32 AM
  Subject: Re: Realtime Search for Social Networks Collaboration
 
  On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
  wrote:
   I also think it's got a
   lot of things now which makes integration difficult to do properly.
 
  I agree, and that's why the major bump in version number rather than
  minor - we recognize that some features will need some amount of
  rearchitecture.
 
   I think the problem with integration with SOLR is it was designed with
   a different problem set in mind than Ocean, originally the CNET
   shopping application.
 
  That was the first use of Solr, but it actually existed before that
  w/o any defined use other than to be a plan B alternative to MySQL
  based search servers (that's actually where some of the parameter
  names come from... the default /select URL instead of /search, the
  rows parameter, etc).
 
  But you're right... some things like the replication strategy were
  designed (well, borrowed from Doug to be exact) with the idea that it
  would be OK to have slightly stale views of the data in the range of
  minutes.  It just made things easier/possible at the time.  But tons
  of Solr and Lucene users want almost instantaneous visibility of added
  documents, if they can get it.  It's hardly restricted to social
  network applications.
 
  Bottom line is that Solr aims to be a general enterprise search
  platform, and getting as real-time as we can get, and as scalable as
  we can get are some of the top priorities going forward.
 
  -Yonik
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread J. Delgado

 approach is
 closely
  tied to real-time search - all query nodes need to see index changes
 fast.
   But Lucene itself offers no replication mechanism, so maybe the
 replication
  is something to figure out separately, say on the Solr level, later on
 once
  we get there.  I think even just the essential real-time search
 requires
  substantial changes to Lucene (I remember seeing large patches in JIRA),
  which makes it hard to digest, understand, comment on, and ultimately
 commit
  (hence the luke warm response, I think).  Bringing other non-essential
  elements into discussion at the same time makes it more difficult t o
   process all this new stuff, at least for me.  Am I the only one who
 finds
  this hard?
 
  That said, it sounds like we have some discussion going (Karl...), so I
  look forward to understanding more! :)
 
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
   From: Yonik Seeley [EMAIL PROTECTED]
   To: java-dev@lucene.apache.org
   Sent: Thursday, September 4, 2008 10:13:32 AM
   Subject: Re: Realtime Search for Social Networks Collaboration
  
   On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
   wrote:
I also think it's got a
lot of things now which makes integration difficult to do properly.
  
   I agree, and that's why the major bump in version number rather than
   minor - we recognize that some features will need some amount of
   rearchitecture.
  
I think the problem with integration with SOLR is it was designed
 with
a different problem set in mind than Ocean, originally the CNET
shopping application.
  
   That was the first use of Solr, but it actually existed before that
   w/o any defined use other than to be a plan B alternative to MySQL
   based search servers (that's actually where some of the parameter
   names come from... the default /select URL instead of /search, the
   rows parameter, etc).
  
   But you're right... some things like the replication strategy were
   designed (well, borrowed from Doug to be exact) with the idea that it
   would be OK to have slightly stale views of the data in the range of
   minutes.  It just made things easier/possible at the time.  But tons
   of Solr and Lucene users want almost instantaneous visibility of added
   documents, if they can get it.  It's hardly restricted to social
   network applications.
  
   Bottom line is that Solr aims to be a general enterprise search
   platform, and getting as real-time as we can get, and as scalable as
   we can get are some of the top priorities going forward.
  
   -Yonik
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

 it integrated into Lucene, would it
  make
  it less confusing to just refer to it as real-time search, so there
  is no
  confusion?
 
  If this is to be initially integrated into Lucene, why are things like
  replication, crowding/field collapsing, locallucene, name service, tag
  index, etc. all mentioned there on the Wiki and bundled with
  description of
  how real-time search works and is to be implemented?  I suppose
  mentioning
  replication kind-of makes sense because the replication approach is
  closely
  tied to real-time search - all query nodes need to see index changes
  fast.
   But Lucene itself offers no replication mechanism, so maybe the
  replication
  is something to figure out separately, say on the Solr level, later on
  once
  we get there.  I think even just the essential real-time search
  requires
  substantial changes to Lucene (I remember seeing large patches in
  JIRA),
  which makes it hard to digest, understand, comment on, and ultimately
  commit
  (hence the luke warm response, I think).  Bringing other non-essential
  elements into discussion at the same time makes it more difficult t o
   process all this new stuff, at least for me.  Am I the only one who
  finds
  this hard?
 
  That said, it sounds like we have some discussion going (Karl...), so I
  look forward to understanding more! :)
 
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
   From: Yonik Seeley [EMAIL PROTECTED]
   To: java-dev@lucene.apache.org
   Sent: Thursday, September 4, 2008 10:13:32 AM
   Subject: Re: Realtime Search for Social Networks Collaboration
  
   On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
   wrote:
I also think it's got a
lot of things now which makes integration difficult to do properly.
  
   I agree, and that's why the major bump in version number rather than
   minor - we recognize that some features will need some amount of
   rearchitecture.
  
I think the problem with integration with SOLR is it was designed
with
a different problem set in mind than Ocean, originally the CNET
shopping application.
  
   That was the first use of Solr, but it actually existed before that
   w/o any defined use other than to be a plan B alternative to MySQL
   based search servers (that's actually where some of the parameter
   names come from... the default /select URL instead of /search, the
   rows parameter, etc).
  
   But you're right... some things like the replication strategy were
   designed (well, borrowed from Doug to be exact) with the idea that it
   would be OK to have slightly stale views of the data in the range
   of
   minutes.  It just made things easier/possible at the time.  But tons
   of Solr and Lucene users want almost instantaneous visibility of
   added
   documents, if they can get it.  It's hardly restricted to social
   network applications.
  
   Bottom line is that Solr aims to be a general enterprise search
   platform, and getting as real-time as we can get, and as scalable as
   we can get are some of the top priorities going forward.
  
   -Yonik
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Marcelo Ochoa

  studied
  and developed by database communities for aquite long time. There has
  been
  recent efforts on how to effeciently integrate Lucene into releational
  databases (see Lucene JVM ORACLE integration, see
 
  http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
 
  I think we should seriously look at joining efforts with open-source
  Database engine projects, written in Java (see
  http://java-source.net/open-source/database-engines) in order to blend
  IR
  and ORM for once and for all.
 
  -- Joaquin
 
 
 
  I've read Jason's Wiki as well.  Actually, I had to read it a number of
  times to understand bits and pieces of it.  I have to admit there is
  still
  some fuzziness about the whole things in my head - is Ocean something
  that
  already works, a separate project on googlecode.com?  I think so.  If
  so,
  and if you are working on getting it integrated into Lucene, would it
  make
  it less confusing to just refer to it as real-time search, so there
  is no
  confusion?
 
  If this is to be initially integrated into Lucene, why are things like
  replication, crowding/field collapsing, locallucene, name service, tag
  index, etc. all mentioned there on the Wiki and bundled with
  description of
  how real-time search works and is to be implemented?  I suppose
  mentioning
  replication kind-of makes sense because the replication approach is
  closely
  tied to real-time search - all query nodes need to see index changes
  fast.
   But Lucene itself offers no replication mechanism, so maybe the
  replication
  is something to figure out separately, say on the Solr level, later on
  once
  we get there.  I think even just the essential real-time search
  requires
  substantial changes to Lucene (I remember seeing large patches in
  JIRA),
  which makes it hard to digest, understand, comment on, and ultimately
  commit
  (hence the luke warm response, I think).  Bringing other non-essential
  elements into discussion at the same time makes it more difficult t o
   process all this new stuff, at least for me.  Am I the only one who
  finds
  this hard?
 
  That said, it sounds like we have some discussion going (Karl...), so I
  look forward to understanding more! :)
 
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
   From: Yonik Seeley [EMAIL PROTECTED]
   To: java-dev@lucene.apache.org
   Sent: Thursday, September 4, 2008 10:13:32 AM
   Subject: Re: Realtime Search for Social Networks Collaboration
  
   On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
   wrote:
I also think it's got a
lot of things now which makes integration difficult to do properly.
  
   I agree, and that's why the major bump in version number rather than
   minor - we recognize that some features will need some amount of
   rearchitecture.
  
I think the problem with integration with SOLR is it was designed
with
a different problem set in mind than Ocean, originally the CNET
shopping application.
  
   That was the first use of Solr, but it actually existed before that
   w/o any defined use other than to be a plan B alternative to MySQL
   based search servers (that's actually where some of the parameter
   names come from... the default /select URL instead of /search, the
   rows parameter, etc).
  
   But you're right... some things like the replication strategy were
   designed (well, borrowed from Doug to be exact) with the idea that it
   would be OK to have slightly stale views of the data in the range
   of
   minutes.  It just made things easier/possible at the time.  But tons
   of Solr and Lucene users want almost instantaneous visibility of
   added
   documents, if they can get it.  It's hardly restricted to social
   network applications.
  
   Bottom line is that Solr aims to be a general enterprise search
   platform, and getting as real-time as we can get, and as scalable as
   we can get are some of the top priorities going forward.
  
   -Yonik
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-- 
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
__
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book Programming the Oracle Database using

Re: Realtime Search for Social Networks Collaboration

 was
  on
  data no more than 1 month old, though user could open the time window by
  including archives).
 
  As for SOLR and OCEAN,  I would argue that these semi-structured search
  engines are becomming more and more like relational databases with
  full-text
  search capablities (without the benefit of full reletional algebra --
  for
  example joins are not possible using SOLR). Notice that real-time CRUD
  operations and transactionality are core DB concepts adn have been
  studied
  and developed by database communities for aquite long time. There has
  been
  recent efforts on how to effeciently integrate Lucene into releational
  databases (see Lucene JVM ORACLE integration, see
 
  http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
 
  I think we should seriously look at joining efforts with open-source
  Database engine projects, written in Java (see
  http://java-source.net/open-source/database-engines) in order to blend
  IR
  and ORM for once and for all.
 
  -- Joaquin
 
 
 
  I've read Jason's Wiki as well.  Actually, I had to read it a number of
  times to understand bits and pieces of it.  I have to admit there is
  still
  some fuzziness about the whole things in my head - is Ocean something
  that
  already works, a separate project on googlecode.com?  I think so.  If
  so,
  and if you are working on getting it integrated into Lucene, would it
  make
  it less confusing to just refer to it as real-time search, so there
  is no
  confusion?
 
  If this is to be initially integrated into Lucene, why are things like
  replication, crowding/field collapsing, locallucene, name service, tag
  index, etc. all mentioned there on the Wiki and bundled with
  description of
  how real-time search works and is to be implemented?  I suppose
  mentioning
  replication kind-of makes sense because the replication approach is
  closely
  tied to real-time search - all query nodes need to see index changes
  fast.
   But Lucene itself offers no replication mechanism, so maybe the
  replication
  is something to figure out separately, say on the Solr level, later on
  once
  we get there.  I think even just the essential real-time search
  requires
  substantial changes to Lucene (I remember seeing large patches in
  JIRA),
  which makes it hard to digest, understand, comment on, and ultimately
  commit
  (hence the luke warm response, I think).  Bringing other non-essential
  elements into discussion at the same time makes it more difficult t o
   process all this new stuff, at least for me.  Am I the only one who
  finds
  this hard?
 
  That said, it sounds like we have some discussion going (Karl...), so I
  look forward to understanding more! :)
 
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
   From: Yonik Seeley [EMAIL PROTECTED]
   To: java-dev@lucene.apache.org
   Sent: Thursday, September 4, 2008 10:13:32 AM
   Subject: Re: Realtime Search for Social Networks Collaboration
  
   On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
   wrote:
I also think it's got a
lot of things now which makes integration difficult to do properly.
  
   I agree, and that's why the major bump in version number rather than
   minor - we recognize that some features will need some amount of
   rearchitecture.
  
I think the problem with integration with SOLR is it was designed
with
a different problem set in mind than Ocean, originally the CNET
shopping application.
  
   That was the first use of Solr, but it actually existed before that
   w/o any defined use other than to be a plan B alternative to MySQL
   based search servers (that's actually where some of the parameter
   names come from... the default /select URL instead of /search, the
   rows parameter, etc).
  
   But you're right... some things like the replication strategy were
   designed (well, borrowed from Doug to be exact) with the idea that it
   would be OK to have slightly stale views of the data in the range
   of
   minutes.  It just made things easier/possible at the time.  But tons
   of Solr and Lucene users want almost instantaneous visibility of
   added
   documents, if they can get it.  It's hardly restricted to social
   network applications.
  
   Bottom line is that Solr aims to be a general enterprise search
   platform, and getting as real-time as we can get, and as scalable as
   we can get are some of the top priorities going forward.
  
   -Yonik
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Marcelo Ochoa

 non real-time.
 
  For example, in my previous life, I designed and help implement a
  quasi-realtime enterprise search engine using Lucene, having a set of
  multi-threaded indexers hitting a set of multiple indexes alocatted
  accross
  different search services which powered a broker based distributed
  search
  interface. The most recent documents provided to the indexers were
  always
  added to the smaller in-memory (RAM) indexes which usually could absorbe
  the
  load of a bulk add transaction and later would be merged into larger
  disk
  based indexes and then flushed to make them ready to absorbe new fresh
  docs.
  We even had further partitioning of the indexes that reflected time
  periods
  with caps on size for them to be merged into older more archive based
  indexes which were used less (yes the search engine default search was
  on
  data no more than 1 month old, though user could open the time window by
  including archives).
 
  As for SOLR and OCEAN,  I would argue that these semi-structured search
  engines are becomming more and more like relational databases with
  full-text
  search capablities (without the benefit of full reletional algebra --
  for
  example joins are not possible using SOLR). Notice that real-time CRUD
  operations and transactionality are core DB concepts adn have been
  studied
  and developed by database communities for aquite long time. There has
  been
  recent efforts on how to effeciently integrate Lucene into releational
  databases (see Lucene JVM ORACLE integration, see
 
  http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
 
  I think we should seriously look at joining efforts with open-source
  Database engine projects, written in Java (see
  http://java-source.net/open-source/database-engines) in order to blend
  IR
  and ORM for once and for all.
 
  -- Joaquin
 
 
 
  I've read Jason's Wiki as well.  Actually, I had to read it a number of
  times to understand bits and pieces of it.  I have to admit there is
  still
  some fuzziness about the whole things in my head - is Ocean something
  that
  already works, a separate project on googlecode.com?  I think so.  If
  so,
  and if you are working on getting it integrated into Lucene, would it
  make
  it less confusing to just refer to it as real-time search, so there
  is no
  confusion?
 
  If this is to be initially integrated into Lucene, why are things like
  replication, crowding/field collapsing, locallucene, name service, tag
  index, etc. all mentioned there on the Wiki and bundled with
  description of
  how real-time search works and is to be implemented?  I suppose
  mentioning
  replication kind-of makes sense because the replication approach is
  closely
  tied to real-time search - all query nodes need to see index changes
  fast.
   But Lucene itself offers no replication mechanism, so maybe the
  replication
  is something to figure out separately, say on the Solr level, later on
  once
  we get there.  I think even just the essential real-time search
  requires
  substantial changes to Lucene (I remember seeing large patches in
  JIRA),
  which makes it hard to digest, understand, comment on, and ultimately
  commit
  (hence the luke warm response, I think).  Bringing other non-essential
  elements into discussion at the same time makes it more difficult t o
   process all this new stuff, at least for me.  Am I the only one who
  finds
  this hard?
 
  That said, it sounds like we have some discussion going (Karl...), so I
  look forward to understanding more! :)
 
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
   From: Yonik Seeley [EMAIL PROTECTED]
   To: java-dev@lucene.apache.org
   Sent: Thursday, September 4, 2008 10:13:32 AM
   Subject: Re: Realtime Search for Social Networks Collaboration
  
   On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
   wrote:
I also think it's got a
lot of things now which makes integration difficult to do properly.
  
   I agree, and that's why the major bump in version number rather than
   minor - we recognize that some features will need some amount of
   rearchitecture.
  
I think the problem with integration with SOLR is it was designed
with
a different problem set in mind than Ocean, originally the CNET
shopping application.
  
   That was the first use of Solr, but it actually existed before that
   w/o any defined use other than to be a plan B alternative to MySQL
   based search servers (that's actually where some of the parameter
   names come from... the default /select URL instead of /search, the
   rows parameter, etc).
  
   But you're right... some things like the replication strategy were
   designed (well, borrowed from Doug to be exact) with the idea that it
   would be OK to have slightly stale views of the data in the range
   of
   minutes.  It just made things easier/possible at the time.  But tons

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado

On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Regarding real-time search and Solr, my feeling is the focus should be on
 first adding real-time search to Lucene, and then we'll figure out how to
 incorporate that into Solr later.


Otis, what do you mean exactly by adding real-time search to Lucene?  Note
that Lucene, being a indexing/search library (and not a full blown search
engine), is by definition real-time: once you add/write a document to the
index it becomes immediately searchable and if a document is logically
deleted and no longer returned in a search, though physical deletion happens
during an index optimization.

Now, the problem of adding/deleting documents in bulk, as part of a
transaction and making these documents available for search immediately
after the transaction is commited sounds more like a search engine problem
(i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be
I/O expensive and thus are usually implemented bached proceeses with some
kind of sync mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a
quasi-realtime enterprise search engine using Lucene, having a set of
multi-threaded indexers hitting a set of multiple indexes alocatted accross
different search services which powered a broker based distributed search
interface. The most recent documents provided to the indexers were always
added to the smaller in-memory (RAM) indexes which usually could absorbe the
load of a bulk add transaction and later would be merged into larger disk
based indexes and then flushed to make them ready to absorbe new fresh docs.
We even had further partitioning of the indexes that reflected time periods
with caps on size for them to be merged into older more archive based
indexes which were used less (yes the search engine default search was on
data no more than 1 month old, though user could open the time window by
including archives).

As for SOLR and OCEAN,  I would argue that these semi-structured search
engines are becomming more and more like relational databases with full-text
search capablities (without the benefit of full reletional algebra -- for
example joins are not possible using SOLR). Notice that real-time CRUD
operations and transactionality are core DB concepts adn have been studied
and developed by database communities for aquite long time. There has been
recent efforts on how to effeciently integrate Lucene into releational
databases (see Lucene JVM ORACLE integration, see
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
)

I think we should seriously look at joining efforts with open-source
Database engine projects, written in Java (see
http://java-source.net/open-source/database-engines) in order to blend IR
and ORM for once and for all.

-- Joaquin





 I've read Jason's Wiki as well.  Actually, I had to read it a number of
 times to understand bits and pieces of it.  I have to admit there is still
 some fuzziness about the whole things in my head - is Ocean something that
 already works, a separate project on googlecode.com?  I think so.  If so,
 and if you are working on getting it integrated into Lucene, would it make
 it less confusing to just refer to it as real-time search, so there is no
 confusion?

 If this is to be initially integrated into Lucene, why are things like
 replication, crowding/field collapsing, locallucene, name service, tag
 index, etc. all mentioned there on the Wiki and bundled with description of
 how real-time search works and is to be implemented?  I suppose mentioning
 replication kind-of makes sense because the replication approach is closely
 tied to real-time search - all query nodes need to see index changes fast.
  But Lucene itself offers no replication mechanism, so maybe the replication
 is something to figure out separately, say on the Solr level, later on once
 we get there.  I think even just the essential real-time search requires
 substantial changes to Lucene (I remember seeing large patches in JIRA),
 which makes it hard to digest, understand, comment on, and ultimately commit
 (hence the luke warm response, I think).  Bringing other non-essential
 elements into discussion at the same time makes it more difficult to
  process all this new stuff, at least for me.  Am I the only one who finds
 this hard?

 That said, it sounds like we have some discussion going (Karl...), so I
 look forward to understanding more! :)


 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: Yonik Seeley [EMAIL PROTECTED]
  To: java-dev@lucene.apache.org
  Sent: Thursday, September 4, 2008 10:13:32 AM
  Subject: Re: Realtime Search for Social Networks Collaboration
 
  On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
  wrote:
   I also think it's got a
   lot of things now which makes integration difficult to do properly.
 
  I agree, and that's why

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread mark harwood




Interesting discussion.

I think we should seriously look at joining efforts with open-source Database 
engine projects

I posted some initial dabblings here with a couple of the databases on your 
list :http://markmail.org/message/3bu5klzzc5i6uhl7 but this is not really a 
scalable solution (which is what Jason and others need)

for example joins are not possible using SOLR). 

It's largely *because* Lucene doesn't do joins that it can be made to scale 
out. I've replaced two large-scale database systems this year with distributed 
Lucene solutions because this scale-out architecture provided significantly 
better performance. These were semi-structured systems too. Lucene's 
comparitively simplistic data model/query model is both a weakness and a 
strength in this regard.


Cheers,
Mark.

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado

On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED]wrote:

for example joins are not possible using SOLR).

 It's largely *because* Lucene doesn't do joins that it can be made to scale
 out. I've replaced two large-scale database systems this year with
 distributed Lucene solutions because this scale-out architecture provided
 significantly better performance. These were semi-structured systems too.
 Lucene's comparitively simplistic data model/query model is both a weakness
 and a strength in this regard.


 Hey, maybe the right way to go for a truly scalable and high performance
semi-structured database is to marry HBase (Big-table like data storage)
with SOLR/Lucene.I concur with you in the sense that simplistic data models
coupled with high performance are the killer.

Let me quote this from the original Bigtable paper from Google:

 Bigtable does not support a full relational data model; instead, it
provides clients with a simple data model that supports dynamic control over
data layout and format, and allows clients to reason about the locality
properties of the data represented in the underlying storage. Data is
indexed using row and column names that can be arbitrary strings. Bigtable
also treats data as uninterpreted strings, although clients often serialize
various forms of structured and semi-structured data into these strings.
Clients can control the locality of their data through careful choices in
their schemas. Finally, Bigtable schema parameters let clients dynamically
control whether to serve data out of memory or from disk.

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado

BTW, quoting Marcelo Ochoa (the developer behind the Oracle/Lucene
implementation) the three minimal features a transactional DB should support
for Lucene integration are:

  1) The ability to define new functions (e.g. lcontains() lscore) which
would allow to bind queries to lucene and obtain document/row scores
  2) An API that would allow DML intercepts, like  Oracle's ODCI.
  3) The ability to extend and/or implement new types of domain indexes
that the engine's query evaluation and execution/optimization planner can
use efficiently.

Thanks Marcelo.

-- Joaquin

On Sun, Sep 7, 2008 at 8:16 AM, J. Delgado [EMAIL PROTECTED]wrote:

 On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED]wrote:

  for example joins are not possible using SOLR).

 It's largely *because* Lucene doesn't do joins that it can be made to
 scale out. I've replaced two large-scale database systems this year with
 distributed Lucene solutions because this scale-out architecture provided
 significantly better performance. These were semi-structured systems too.
 Lucene's comparitively simplistic data model/query model is both a weakness
 and a strength in this regard.


  Hey, maybe the right way to go for a truly scalable and high performance
 semi-structured database is to marry HBase (Big-table like data storage)
 with SOLR/Lucene.I concur with you in the sense that simplistic data models
 coupled with high performance are the killer.

 Let me quote this from the original Bigtable paper from Google:

  Bigtable does not support a full relational data model; instead, it
 provides clients with a simple data model that supports dynamic control over
 data layout and format, and allows clients to reason about the locality
 properties of the data represented in the underlying storage. Data is
 indexed using row and column names that can be arbitrary strings. Bigtable
 also treats data as uninterpreted strings, although clients often serialize
 various forms of structured and semi-structured data into these strings.
 Clients can control the locality of their data through careful choices in
 their schemas. Finally, Bigtable schema parameters let clients dynamically
 control whether to serve data out of memory or from disk.

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread Otis Gospodnetic

Hi,

- Original Message 
From: J. Delgado [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Sunday, September 7, 2008 4:04:58 AM
Subject: Re: Realtime Search for Social Networks Collaboration

On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Regarding real-time search and Solr, my feeling is the focus should be on first 
adding real-time search to Lucene, and then we'll figure out how to incorporate 
that into Solr later.

Otis, what do you mean exactly by adding real-time search to Lucene?  Note 
that Lucene, being a indexing/search library (and not a full blown search 
engine), is by definition real-time: once you add/write a document to the 
index it becomes immediately searchable and if a document is logically deleted 
and no longer returned in a search, though physical deletion happens during an 
index optimization.

OG: When I think about real-time search I see it as: Make the newly added 
document show up in search results without closing and reopening the whole 
index with IndexWriter.  In other words, minimize re-reading of the 
old/unchanged data just to be able to see the newly added data.

I believe this is similar to what IndexReader.reopen does and Jason does 
make use of it.

Otis

Now, the problem of adding/deletingdocuments in bulk, as part of a transaction 
and making these documents available for search immediately after the 
transaction is commited sounds more like a search engine problem (i.e. SOLR, 
Nutch, Ocean), specially if these transactions are known to be I/O expensive 
and thus are usually implemented bached proceeses with some kind of sync 
mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a 
quasi-realtime enterprise search engine using Lucene, having a set of 
multi-threaded indexers hitting a set of multiple indexes alocatted accross 
different search services which powered a broker based distributed search 
interface. The most recent documents provided to the indexers were always added 
to the smaller in-memory (RAM) indexes which usually could absorbe the load of 
a bulk add transaction and later would be merged into larger disk based 
indexes and then flushed to make them ready to absorbe new fresh docs. We even 
had further partitioning of the indexes that reflected time periods with caps 
on size for them to be merged into older more archive based indexes which were 
used less (yes the search engine default search was on data no more than 1 
month old, though user could open the time window by including archives).

As for SOLR and OCEAN,  I would argue that these semi-structured search engines 
are becomming more and more like relational databases with full-text search 
capablities (without the benefit of full reletional algebra -- for example 
joins are not possible using SOLR). Notice that real-time CRUD operations and 
transactionality are core DB concepts adn have been studied and developed by 
database communities for aquite long time. There has been recent efforts on how 
to effeciently integrate Lucene into releational databases (see Lucene JVM 
ORACLE integration, see 
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)

I think we should seriously look at joining efforts with open-source Database 
engine projects, written in Java (see 
http://java-source.net/open-source/database-engines) in order to blend IR and 
ORM for once and for all.

-- Joaquin 

I've read Jason's Wiki as well.  Actually, I had to read it a number of times 
to understand bits and pieces of it.  I have to admit there is still some 
fuzziness about the whole things in my head - is Ocean something that already 
works, a separate project on googlecode.com?  I think so.  If so, and if you 
are working on getting it integrated into Lucene, would it make it less 
confusing to just refer to it as real-time search, so there is no confusion?

If this is to be initially integrated into Lucene, why are things like 
replication, crowding/field collapsing, locallucene, name service, tag index, 
etc. all mentioned there on the Wiki and bundled with description of how 
real-time search works and is to be implemented?  I suppose mentioning 
replication kind-of makes sense because the replication approach is closely 
tied to real-time search - all query nodes need to see index changes fast.  But 
Lucene itself offers no replication mechanism, so maybe the replication is 
something to figure out separately, say on the Solr level, later on once we 
get there.  I think even just the essential real-time search requires 
substantial changes to Lucene (I remember seeing large patches in JIRA), which 
makes it hard to digest, understand, comment on, and ultimately commit (hence 
the luke warm response, I think).  Bringing other non-essential elements into 
discussion at the same time makes it more difficult t o
 process all this new stuff, at least for me.  Am I the only

Re: Realtime Search for Social Networks Collaboration

 requires 
 substantial changes to Lucene (I remember seeing large patches in JIRA), 
 which makes it hard to digest, understand, comment on, and ultimately commit 
 (hence the luke warm response, I think).  Bringing other non-essential 
 elements into discussion at the same time makes it more difficult to
  process all this new stuff, at least for me.  Am I the only one who finds 
 this hard?

 That said, it sounds like we have some discussion going (Karl...), so I look 
 forward to understanding more! :)


 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Yonik Seeley [EMAIL PROTECTED]
 To: java-dev@lucene.apache.org
 Sent: Thursday, September 4, 2008 10:13:32 AM
 Subject: Re: Realtime Search for Social Networks Collaboration

 On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
 wrote:
  I also think it's got a
  lot of things now which makes integration difficult to do properly.

 I agree, and that's why the major bump in version number rather than
 minor - we recognize that some features will need some amount of
 rearchitecture.

  I think the problem with integration with SOLR is it was designed with
  a different problem set in mind than Ocean, originally the CNET
  shopping application.

 That was the first use of Solr, but it actually existed before that
 w/o any defined use other than to be a plan B alternative to MySQL
 based search servers (that's actually where some of the parameter
 names come from... the default /select URL instead of /search, the
 rows parameter, etc).

 But you're right... some things like the replication strategy were
 designed (well, borrowed from Doug to be exact) with the idea that it
 would be OK to have slightly stale views of the data in the range of
 minutes.  It just made things easier/possible at the time.  But tons
 of Solr and Lucene users want almost instantaneous visibility of added
 documents, if they can get it.  It's hardly restricted to social
 network applications.

 Bottom line is that Solr aims to be a general enterprise search
 platform, and getting as real-time as we can get, and as scalable as
 we can get are some of the top priorities going forward.

 -Yonik

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Yonik Seeley

There's a good percent of the Solr community that is looking to add
everything you are (from a functional point of view).  Some of the
other little things that we haven't considered (like a remote Java
API) sound cool... no reason not to add that also.  We're also
planning on adding alternatives to some of the things you don't
currently like about Solr (HTTP, XML config, etc).

Apache has always emphasized community over code... and it's a large
part of what open source is about here.  It's not always easier and
faster to work in an open community, making compromises and trying to
reach general consensus, but it tends to be good for projects in the
long term.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

Hi Yonik,

I fully agree with good for projects in the long term.  I just
figured it would be best if someone went ahead and built the things
and they could be integrated later into other projects, that's why I
checked them into Apache as patches.  Sounds like a few folks like
Shalin and Noble would like to build a SOLR specific realtime search.
I think that's a good idea that I may be able to offer some help on.
Realtime is relative anyways, for many projects database like updates
are probably not necessary, neither is replication, or perhaps even
100% uptime and scalability.  I just want the features, and if someone
would like to work with me to get them into core Lucene and SOLR
projects that would be cool.  If not at least the code is out there to
get ideas from.  These discussions are a good starting point.

Cheers,
Jason

On Sat, Sep 6, 2008 at 11:21 AM, Yonik Seeley [EMAIL PROTECTED] wrote:
 There's a good percent of the Solr community that is looking to add
 everything you are (from a functional point of view).  Some of the
 other little things that we haven't considered (like a remote Java
 API) sound cool... no reason not to add that also.  We're also
 planning on adding alternatives to some of the things you don't
 currently like about Solr (HTTP, XML config, etc).

 Apache has always emphasized community over code... and it's a large
 part of what open source is about here.  It's not always easier and
 faster to work in an open community, making compromises and trying to
 reach general consensus, but it tends to be good for projects in the
 long term.

 -Yonik

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Shalin Shekhar Mangar

Hi Jason,

I think this is a misunderstanding. I only want to add these features
incrementally so that users can use them as soon as possible, rather than
delay them to a later release by re-architecting (which may take more time
and shift our focus from our users).

The features are more important than the code but it will of course help a
lot too. I think a good starting point for us (Lucene/Solr folks) would be
to study Ocean's source and any documentation that you can provide so that
we can also suggest an optimal integration strategy or alternate
implementation ideas. Until now the bulk of such work has been on your
shoulders. I appreciate your patience and the amount of work you have put
in. These features will be a huge value proposition for our users and a
collaboration will be the good for the community in the long term.

On Sat, Sep 6, 2008 at 9:11 PM, Jason Rutherglen [EMAIL PROTECTED]
 wrote:

 Hi Yonik,

 I fully agree with good for projects in the long term.  I just
 figured it would be best if someone went ahead and built the things
 and they could be integrated later into other projects, that's why I
 checked them into Apache as patches.  Sounds like a few folks like
 Shalin and Noble would like to build a SOLR specific realtime search.
 I think that's a good idea that I may be able to offer some help on.
 Realtime is relative anyways, for many projects database like updates
 are probably not necessary, neither is replication, or perhaps even
 100% uptime and scalability.  I just want the features, and if someone
 would like to work with me to get them into core Lucene and SOLR
 projects that would be cool.  If not at least the code is out there to
 get ideas from.  These discussions are a good starting point.

 Cheers,
 Jason

 On Sat, Sep 6, 2008 at 11:21 AM, Yonik Seeley [EMAIL PROTECTED] wrote:
  There's a good percent of the Solr community that is looking to add
  everything you are (from a functional point of view).  Some of the
  other little things that we haven't considered (like a remote Java
  API) sound cool... no reason not to add that also.  We're also
  planning on adding alternatives to some of the things you don't
  currently like about Solr (HTTP, XML config, etc).
 
  Apache has always emphasized community over code... and it's a large
  part of what open source is about here.  It's not always easier and
  faster to work in an open community, making compromises and trying to
  reach general consensus, but it tends to be good for projects in the
  long term.
 
  -Yonik
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-- 
Regards,
Shalin Shekhar Mangar.

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Grant Ingersoll



On Sep 6, 2008, at 4:36 AM, Otis Gospodnetic wrote:

Regarding real-time search and Solr, my feeling is the focus should  
be on first adding real-time search to Lucene, and then we'll figure  
out how to incorporate that into Solr later.


I've read Jason's Wiki as well.  Actually, I had to read it a number  
of times to understand bits and pieces of it.  I have to admit there  
is still some fuzziness about the whole things in my head - is  
Ocean something that already works, a separate project on  
googlecode.com?  I think so.  If so, and if you are working on  
getting it integrated into Lucene, would it make it less confusing  
to just refer to it as real-time search, so there is no confusion?


If this is to be initially integrated into Lucene, why are things  
like replication, crowding/field collapsing, locallucene, name  
service, tag index, etc. all mentioned there on the Wiki and bundled  
with description of how real-time search works and is to be  
implemented?  I suppose mentioning replication kind-of makes sense  
because the replication approach is closely tied to real-time search  
- all query nodes need to see index changes fast.  But Lucene itself  
offers no replication mechanism, so maybe the replication is  
something to figure out separately, say on the Solr level, later on  
once we get there.  I think even just the essential real-time  
search requires substantial changes to Lucene (I remember seeing  
large patches in JIRA), which makes it hard to digest, understand,  
comment on, and ultimately commit (hence the luke warm response, I  
think).  Bringing other non-essential elements into discussion at  
the same time makes it more difficult to
process all this new stuff, at least for me.  Am I the only one who  
finds this hard?


Yeah, I agree.  There's a place for RT search in Lucene, but it seems  
to me we have a pretty good search server in Solr that needs some  
things going forward, but are reasonable to work on there.  It makes  
sense to me not to duplicate efforts on all of those fronts and have  
two projects/communities that share  80-90% of their functionality  
(either existing, or planned).  As Yonik says, it may take longer than  
just doing it by oneself, but in the long run, the outcome is usually  
better.


My two cents,
Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Paul Elschot

Op Saturday 06 September 2008 18:53:39 schreef Shalin Shekhar Mangar:
...

 The features are more important than the code but it will of course
 help a lot too. I think a good starting point for us (Lucene/Solr
 folks) would be to study Ocean's source and any documentation that
 you can provide so that we can also suggest an optimal integration
 strategy or alternate implementation ideas. Until now the bulk of
 such work has been on your shoulders. I appreciate your patience and
 the amount of work you have put in. These features will be a huge
 value proposition for our users and a collaboration will be the good
 for the community in the long term.

Some experience from larger patches:
- stepwise is good,
- so plan for steps, in which
- each step is improvement on its own.

Then:
- try to keep the first step as small as possible,
- with some luck, someone else will improve the first step,
- learn from the improvement,
- repeat, and never hurry.


Some comments on the current patch at LUCENE-1313:
- Copyright is assigned to individual authors, better assign that to
  ASF.
- Individual authors are mentioned in the code, that's not lucene
  policy at the moment.
- Some files do not contain an ASF licence, not a real problem.
- The directory structure could also be in contrib/ocean as
  top directory.
- There is a whole package of logging in there, but there's no logging
  in lucene at the moment.
- There is at least one empty class, SearcherPolicy.
- Unseen so far:
   - the second half of the patch, 
   - the java code within the class {...} statements (sorry.)


Even though the patch is down to 25% of it's first size,
it's still 474 kb, which is large by any standard. So the
question is: is there a first step to be taken from this
patch that would be an improvement on its own?

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

Hello Shalin,

When I tried to integrate before it seemed fairly simple.  However the
Ocean core code wasn't quite up to par yet so that needed work.  It
will help to work with SOLR people directly who can figure how they
want to integrate such as yourself.  Right now I'm finishing up the
OceanDatabase portion (sorry for all the Ocean names and things, these
can be changed, doesn't matter, but it should be something we agree
on).  The methods to TransactionSystem are like IndexWriter.  The
update method for OceanDatabase is perform(Action action).  There are
3 actions, Insert, Update, Delete.  To execute queries the whole thing
is abstracted out as a Task.  The method is Object run(Task task).
Where task gets a reference to the TransactionSytem. I implemented a
MultiThreadSearchTask that as the name suggests, executes a query in
multiple threads over the latest Snapshot.  The reason for the Task
abstraction is to give the client complete access to the server via a
potentially dynamically loaded subclass of Task.  OceanDatabase should
be the main class for most uses of the realtime system because it
implements optimistic concurrency.  I prefer the simplicity of the
main entry point into the search server being only two methods, with
the run method offering unlimited functionality without recompiling,
building and deploying the server for each new piece of functionality
required.

Regards,
Jason

On Sat, Sep 6, 2008 at 12:53 PM, Shalin Shekhar Mangar
[EMAIL PROTECTED] wrote:
 Hi Jason,

 I think this is a misunderstanding. I only want to add these features
 incrementally so that users can use them as soon as possible, rather than
 delay them to a later release by re-architecting (which may take more time
 and shift our focus from our users).

 The features are more important than the code but it will of course help a
 lot too. I think a good starting point for us (Lucene/Solr folks) would be
 to study Ocean's source and any documentation that you can provide so that
 we can also suggest an optimal integration strategy or alternate
 implementation ideas. Until now the bulk of such work has been on your
 shoulders. I appreciate your patience and the amount of work you have put
 in. These features will be a huge value proposition for our users and a
 collaboration will be the good for the community in the long term.

 On Sat, Sep 6, 2008 at 9:11 PM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:

 Hi Yonik,

 I fully agree with good for projects in the long term.  I just
 figured it would be best if someone went ahead and built the things
 and they could be integrated later into other projects, that's why I
 checked them into Apache as patches.  Sounds like a few folks like
 Shalin and Noble would like to build a SOLR specific realtime search.
 I think that's a good idea that I may be able to offer some help on.
 Realtime is relative anyways, for many projects database like updates
 are probably not necessary, neither is replication, or perhaps even
 100% uptime and scalability.  I just want the features, and if someone
 would like to work with me to get them into core Lucene and SOLR
 projects that would be cool.  If not at least the code is out there to
 get ideas from.  These discussions are a good starting point.

 Cheers,
 Jason

 On Sat, Sep 6, 2008 at 11:21 AM, Yonik Seeley [EMAIL PROTECTED] wrote:
  There's a good percent of the Solr community that is looking to add
  everything you are (from a functional point of view).  Some of the
  other little things that we haven't considered (like a remote Java
  API) sound cool... no reason not to add that also.  We're also
  planning on adding alternatives to some of the things you don't
  currently like about Solr (HTTP, XML config, etc).
 
  Apache has always emphasized community over code... and it's a large
  part of what open source is about here.  It's not always easier and
  faster to work in an open community, making compromises and trying to
  reach general consensus, but it tends to be good for projects in the
  long term.
 
  -Yonik
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 --
 Regards,
 Shalin Shekhar Mangar.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

Hi Grant,

I think the way to integrate with SOLR and Lucene is if people who are
committers to the respective projects work with me (if they want) on
the integration which will make it fairly straightforward as it was
designed and intended to be.

Cheers,
Jason

On Sat, Sep 6, 2008 at 3:16 PM, Grant Ingersoll [EMAIL PROTECTED] wrote:

 On Sep 6, 2008, at 4:36 AM, Otis Gospodnetic wrote:

 Regarding real-time search and Solr, my feeling is the focus should be on
 first adding real-time search to Lucene, and then we'll figure out how to
 incorporate that into Solr later.

 I've read Jason's Wiki as well.  Actually, I had to read it a number of
 times to understand bits and pieces of it.  I have to admit there is still
 some fuzziness about the whole things in my head - is Ocean something that
 already works, a separate project on googlecode.com?  I think so.  If so,
 and if you are working on getting it integrated into Lucene, would it make
 it less confusing to just refer to it as real-time search, so there is no
 confusion?

 If this is to be initially integrated into Lucene, why are things like
 replication, crowding/field collapsing, locallucene, name service, tag
 index, etc. all mentioned there on the Wiki and bundled with description of
 how real-time search works and is to be implemented?  I suppose mentioning
 replication kind-of makes sense because the replication approach is closely
 tied to real-time search - all query nodes need to see index changes fast.
  But Lucene itself offers no replication mechanism, so maybe the replication
 is something to figure out separately, say on the Solr level, later on once
 we get there.  I think even just the essential real-time search requires
 substantial changes to Lucene (I remember seeing large patches in JIRA),
 which makes it hard to digest, understand, comment on, and ultimately commit
 (hence the luke warm response, I think).  Bringing other non-essential
 elements into discussion at the same time makes it more difficult to
 process all this new stuff, at least for me.  Am I the only one who finds
 this hard?

 Yeah, I agree.  There's a place for RT search in Lucene, but it seems to me
 we have a pretty good search server in Solr that needs some things going
 forward, but are reasonable to work on there.  It makes sense to me not to
 duplicate efforts on all of those fronts and have two projects/communities
 that share  80-90% of their functionality (either existing, or planned).
  As Yonik says, it may take longer than just doing it by oneself, but in the
 long run, the outcome is usually better.

 My two cents,
 Grant

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration