Explore other in-memory postinglist formats for realtime search
---
Key: LUCENE-2346
URL: https://issues.apache.org/jira/browse/LUCENE-2346
Project: Lucene - Java
Issue Type
dows XP it was
definitely faster before (I saw XP perform lots of disk access
when creating small files) so perhaps there's something else
going on.
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LU
r that is much slower.
I haven't ruled out having done something really stupid in my test, in which
case I apologise in advance!
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LUCENE-1313
>
oing down?
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Jav
ucene 3.0, with most
of the time lost in searching. I'd expected the RAMDirectory implementation to
be faster - is what I'm seeing against your expectations too?
> Near Realtime Search (using a built in RAMDirectory)
>
>
&
.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch,
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch,
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch,
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
tch to my checked-out version of trunk
(revision 895585) but it appears that the PrefixSwitchDirectory class is
missing - is there another patch that is needed to get this working?
> Near Realtime Search (using a built in
used, the first
half are primary dir segments, the second, ram dir segments.
The above mentioned changes of course break many unit tests. I'm
going through and evaluating what do on a case by case basis,
and am open to suggestions.
> Near Realtime Search (using a built in RAMD
. Some formatting has been
cleaned up, javadocs added.
I ran TestNRTReaderWithThreads2 a couple times for kicks and didn't see the
assert sr.hasChanges error. I'll probably focus on adding more stress testing.
> Near Realtime Search (using a built in
read for NRT.flush, however, I've also been
debugging this assert !sr.hasChanges issue, which out of 7000 iterations,
occurs once, and is fairly minor. Hmm... Apply deletes shouldn't really
conflict so I'm hoping this isn't an original bug unrelated to LUCENE-1313.
> Ne
s in readerPool.release, and it fails
sometimes. I'm not quite sure why yet.
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LUCENE-1313
> URL: https://issues.apache.org
next feature is to have NRT.flush execute in a single background
thread rather than block update doc calls.
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LUCENE-1313
>
k at this soon Jason! Sounds like good progress...
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
>
x27;s occurring.
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>
being set to false by
the primary writer, so the ram writer wasn't removing the infos from
mergingSegments in mergeFinish.
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LUCENE-1313
>
rrors may crop up. Once it runs successfully for
lets say 30 minutes, we can beef up the stress testing of this
patch by doing concurrent updates, deletes, etc.
> Near Realtime Search (using a built in RAMDirectory)
>
>
>
is is important because deletes may be coming in as we're
merging. However I'm not sure this will work without a shared
lock between the writers for commitMergedDeletes which requires
syncing.
Mike, can you take a look to see if this path will work?
> Near Realtime Search (using a bui
and made the flush method non-synchronized
* There's a subtle synchronization bug causing files to not be found in the
testRandomThreads method
* There's excessive merge logging to debug the sync issue
> Near Realtime Search (using a built in
ushed, and the ram segments are synchronously merged
to the primary writer using a mechanism similar to
addIndexesNoOptimize.
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LUCENE-1313
>
leak, I'm trying to get a copy of an open
source Yourkit license for profiling the heap usage. The code is unfortunately
quite large so whatever it is, is probably easy to fix and hard to find.
> Near Realtime Search (using a built in
[
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Rutherglen updated LUCENE-1313:
-
Attachment: LUCENE-1313.patch
Ah, missing file now included.
> Near Realtime Sea
, so there's some variation.
> Near Realtime Search (using a built in RAMDirectory)
>
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
>
[
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Rutherglen updated LUCENE-1313:
-
Summary: Near Realtime Search (using a built in RAMDirectory) (was: Near
Realtime
pass
on 2 runs. I didn't make any changes though so am suspicious!
> Near Realtime Search
>
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>
sureContiguousMerge is to keep
docStoreSegments together, this will work as ramDir and
primaryDir docStores should not need to be adjacent (I think,
and need to verify).
> Near Realtime Search
>
>
> Key: LUCENE-1313
> URL: http
idea would require quite a bit
of work. Perhaps OneMerge can have the segmentInfos (ramDir or
primaryDir) they were selected from and the ensureContiguous can
verify that? Then we'd adjust commitMerge to remove the newly
merged segments individually.
I'll give this a try.
> Nea
n IW seems to be problematic as
we'll always potentially have different dir non-contiguous
infos. I'm seeing the error off and on in different test cases.
I will put together a patch separating the two dir infos in IW.
> Near Realtime Search
>
>
>
,
testAddIndexesWithCloseNoWait fails, which I don't think
happened before. testAddIndexOnDiskFull fails when
autoCommit=true which I'm not sure is a valid test by the time
this patch goes in but it probably needs to be looked into.
The other previous notes are still valid.
> Near Re
ed to go through and mark the tests that can be converted to
be NRT specific.
> Near Realtime Search
>
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>
run only if flushToRAM=false.
That seems good?
> Near Realtime Search
>
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue Type: New Feature
>
merge all of
them, in order, that merge should be contiguous?
> Near Realtime Search
>
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>
ng a ensureContiguousMerge
exception. I think this is highlighting the change to merging
all ram segments to a single primaryDir segment can sometimes
lead to choosing segments that are non-contiguous? I'm not sure
of the best way to handle this.
> Near Rea
ditionalize them to run only if flushToRAM=false.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue Type: New Feature
>
[
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Rutherglen updated LUCENE-1313:
-
Summary: Near Realtime Search (was: Realtime Search)
> Near Realtime Sea
It's very much in progress, but 1) the iterations are
slow (it's a big patch), 2) it's a biggish change so I'd prefer to it shortly
after a release, not shortly before, so it has plenty of time to "bake" on
trunk.
> Realtime Search
> ---
the impression this was a likely 3.1 ...
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue Type: New Feature
> Co
ead and write documents to and from the
docstore. It seemd to work on Windows.
* I think there's more that can be done to more accurately
manage the RAM however I think the way it works is a good
starting point.
> Realtime Search
> ---
>
>
e need to test flushToRAM with a custom
IndexDeletionPolicy which could be a bit tricky.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>
olicy to select non-contiguous merges.
But: with autoCommit=false, in order to avoid merging the doc stores, the
segments (even RAM segments) must be contiguous. This is a sizable performance
gain when building a large index in one IndexWriter session.
> Realtime Search
> ---
>
&
ther when merging?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue Type: New Feature
> Components: I
? Or only
wait for the merges to complete that are from ramDir to
primaryDir?
{quote}
I think only ramDir -> primaryDir? commit() today doens't block on BG merges.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issu
mmit is called, do we still want
to not have concurrent merges execute synchronously? Or only
wait for the merges to complete that are from ramDir to
primaryDir?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org
ry file
descriptors.
{quote}let's take this up under a new issue?{quote}
The issue would only be the unit test for now, or should it be a
part of an existing issue? Ok, it will clean up LUCENE-1313's
unit test class.
> Realtime Search
> ---
>
>
y copy-on-write) I think are higher priority.
bq. Doesn't FSDir open only one FD per file?
No, it opens a real file every time openInput is called. I guess we could think
about having it share/clone internally?
> Realtime Search
> ---
>
> Key:
ermvector files merged on disk for
every segment?
{quote}We also should [separately] consider having multiple
SegmentReaders that share the same docStores{quote}
Doesn't FSDir open only one FD per file?
> Realtime Search
> ---
>
> Key: LUCENE-
,
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch,
> LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch,
> lucene-1313.patch
>
>
> Enable near realtime search in Lucene without external
> dependencies. When RAM NRT is enabled,
to the
file for the next rambuf segment?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue Type: New Feature
>
bout concurrency with the docstores
That's a big (and good) change; I think we should save that one for another
issue, and leave this one focusing on flushing segments through a RAMDir?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: h
, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch,
> LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch,
> lucene-1313.patch
>
>
> Enable near realtime search in Lucene without external
> dependencies. When RAM NRT is enabled, the implementation adds a
&
ly less than half of the ram used) when flushToRam=true
so that we can get a version of this functionality out the door,
then iterate as we gather feedback from users?
I'll include the comments in the next patch.
> Realtime Search
> ---
>
> Key: LUCEN
r() turnaround. IE we
can't make getReader() do that flush synchronously. So that needs to
be a BG merge, but we must somehow temporarily disregard the size of
those segments while the merge is running. Or, perhaps we "merge RAM
segments to disk" a bit early, eg once RAM con
[
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Rutherglen updated LUCENE-1313:
-
Description:
Enable near realtime search in Lucene without external
dependencies. When
for
space, the user will presumably want to customize this.
* I'm not sure the flushing always occurs when it should, and
not sure yet how to test to insure it's flushing when it should
(other than watching a log). What happened to the adding logging
to Lucene patch?
> Re
ested and perhaps prioritizing requests?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue Type: New Feature
> Compon
s on. It
utilizes the regular merge policy and the ram merge policy.
* The ram dir size is pushed to DocumentsWriter
* RAMMergePolicy extends LogDocMergePolicy and defaults the
useCompoundFile and useCompoundDocStore to false
* Sorry for the whitespace stuff, I'll clean it up later, I
wanted
Yes.
{quote}we should fix the indexing chain to always use
SegmentWriteState's Directory and not pass Directory to the
ctors{quote}
Yep.
The next patch will have these features.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https:
AM is free in the budget.
bq. We should be able to rely on the directory in SegmentWriteState?
I think we should fix the indexing chain to always use SegmentWriteState's
Directory and *not* pass Directory to the ctors? Does something go wrong if we
take that
w segment will be written to disk. For this reason we
can't simply pass a directory into the constructor of
DocumentsWriter, nor can we rely on calling
IW.getFlushDirectory. We should be able to rely on the directory
in SegmentWriteState?
> Realtime Search
> ---
>
>
g to the meta merge policy which
would clean up IW from managing ram segs vs. prim segs.
Does IW.optimize and IW.expungeDeletes operate on the ramdir as
well (the expungeDeletes javadoc implies calling
IR.numDeletedDocs will return zero when there are no deletes).
> Realtime
rarely -- this shouldn't lead to OOMs. One can always subclass CMS if
this is somehow a problem. Or we could modify CMS to pool its threads
(as a new issue)?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apa
, maybe we can borrow some code from
an Apache project that's implemented a threadpool.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
&
ver how much of the RAM buffer it's going to give to DW,
too. At first the policy should not change the non-NRT case (ie one
always flushes straight to disk). We can play w/ that in a separate
issue. Need to think more about the logic...
> Realtime Search
> -
hey are only make up 20% of the total size of the
ram segments? If we merge the 20% to disk it seems inefficient?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
>
ing this system?). It seems like a variation on
the LogByteSizeMergePolicy however it's unclear whether
LogDocMergePolicy or LogByteSizeMergePolicy ram merges will perform
better (does it matter since it's all in ram and we're capping the
total?)
> Realtime Search
&g
rgePolicy.OneMerge.segString no longer needs to take a
Directory (because it now stores a Directory).{quote}
Yeah, I noticed this, I'll change it. MergeSpecification.segString is
public and takes a directory that is not required. What to do?
> Realtime Search
> ---
cause it now stores a Directory).
* The mergeRAMSegmentsToDisk shouldn't be fully synchronized, eg
when doWait is true it should release the lock while merges are
taking place.
> Realtime Search
> ---
>
> Key: LUCENE-1313
>
mply add a boolean param in the ctor to turn
on NRT instead of relying on getReader. Using getReader could
cause problems with switching directories midstream.
{quote}
Yes, let's switch to that.
> Realtime Search
> ---
>
> Key: LUCENE-1313
>
dd a boolean param in the ctor to turn
on NRT instead of relying on getReader. Using getReader could
cause problems with switching directories midstream.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira
UCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch,
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.
> Possible future directions:
> * Optimistic concurrency
> * Replication
> Encoding
gment, but also when it's time
> for DW to flush a new segment
In the new patch this is fixed.
{quote}
I don't see where this is taken into account? Did you mean to attach
a new patch?
> Realtime Search
> ---
>
> Key: LUCENE-1313
>
he algorithm is fairly simple? Find segments whose
total sizes are lower than whatever we have left of the max ram
buffer size? I have new code, but will rework it a bit to
include this discussion.
> Realtime Search
> ---
>
> Key: LUCENE-1313
>
net RAM buffer). Instead we should force RAM -> disk at that
point, even though technically RAM is not yet full.
Ooh: maybe a better approach is to disallow the merge if the expected
peak RAM usage will exceed our buffer. I like this better. So if
budget is 32 MB, and net RAM
nough in updatePendingMerges or is
there more that needs to be done?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue Type
ng the new ramOverLimit?
* Still some noise (MockRAMDir, DocFieldProcessorPerThread, some
changes in LogMergePolicy)
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-131
in CHANGES.txt
I'm going to integrate LUCENE-1618 and test that out as a part of the next
patch.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene
o the IndexWriter constructor. This because we
can't run the system and the ram dir is changed in the middle of
an operation.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
>
dexWriter (currently)
cannot handle {code}
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue Type: New Feature
> Com
ocked, but I don't think
it's necessary now if we have a separate ram merge policy.
{quote}
OK good.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
>
"internal" approach vs "external" one.{quote}
I think having the ram merge policy should cover the reasons I
had for having a separate ram writer. Although the IW.addWriter
method I implemented would not have blocked, but I don't think
t; one.{quote}
I think having the ram merge policy should cover the reasons I
had for having a separate ram writer. Although the IW.addWriter
method I implemented would not have blocked, but I don't think
it's necessary now if we have a separate ram merge policy.
> Realtime Search
&g
uest in a more optimal way (like an elevator,
sweeping floors). I haven't explictly tested this with Lucene...
I believe SSDs handle concurrent requests very well since under the
hood most of them are multi-channel basically RAID0 devices (eg Intel
X25M has 10 channels).
> Realtime Searc
t shouldn'y be too much
faster given the IO sequential access bottleneck?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue
file creation speeds will also be heavily dependent on the
exact file system used.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>
FS file on disk). That might get most of the gains
since the FSDir sees only one file created per tiny segment, not N.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
>
sh slower?
{quote}
I'm still a little confused as to
why having a wrapper class that manages a disk writer and a ram
writer isn't cleaner?
{quote}
This is functionally the same as not mixing RAM vs disk merging,
right (ie just as "clean")?
> Realtime Search
>
o merge (when we've decided
we have too much in ram), sometimes we don't (when we just want
the ram segments to merge). I'm still a little confused as to
why having a wrapper class that manages a disk writer and a ram
writer isn't cleaner?
> Realtime Search
> ---
his issue focused on sometimes using RAMDir for newly created
segments.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
> Issue Type: New
To implement this functionality in parallel (and perhaps make
the overall patch cleaner), writing doc stores directly to a
separate directory can be a different patch? There can be an
option IW.setDocStoresDirectory(Directory) that the patch
implements? Then some unit tests that are separate from the
Then some unit tests that are separate from the near
realtime portion.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>
member. FSIndexOutput
uses a writeable RAF and FSIndexInput is read only why would
there be an issue?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lu
mponents: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch,
> LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucen
les punch straight through to the real directory, ie
bypass the RAMDir. The doc stores are space consuming, and since with
autoCommit=false we can bypass merging them, it makes no sense to store them in
the RAMDir.
We should probably do this optimization as a "phase 2", after
er writers' pool? Maybe the SegmentInfo can have a
reference to the writer it originated in? That way we can easily
access the right reader pool when we need it?
{quote}
I don't think we need two writers? I think one writer, sometimes
flushing to RAMDir, is a clean solution?
w the destination
writer obtains segmentreaders from source readers when they're
in the other writers' pool? Maybe the SegmentInfo can have a
reference to the writer it originated in? That way we can easily
access the right reader pool when we need it?
> Realtime Search
>
since there are fewer new files to fsync.
{quote}
Agreed, however the IW.getReader MultiSegmentReader removes
readers from another directory so we'd need to add a new
attribute to segmentinfo that marks it as ok for inclusion in
the MSR?
{quote}
Or, fix that f
y so we'd need to add a new
attribute to segmentinfo that marks it as ok for inclusion in
the MSR?
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project
uld work well I think, and should not require a separate
RAMIndex class, and won't block things when the RAM segments are
migrated to disk by CMS.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUC
the incoming IW, leave them in
RAM and they can be merged to disk as necessary? Then on
IW.flush any segmentinfo(s) that are not from the current
directory can be flushed to disk?
Just thinking out loud about this.
> Realtime Search
> ---
>
>
s the loss of
concurrency where a large rambuffer may be flushing to disk
while the user really wants to small incremental NRT RI based
updates at the same time.
> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.or
the RAMIndex
reader if there is one.
The RAMIndex writer can be obtained and modified directly as
opposed to duplicating the setter methods of IndexWriter such as
setMergeScheduler.
> Realtime Search
> ---
>
> Key: LUCENE-1313
>
1 - 100 of 231 matches
Mail list logo