Most of all, I'm trying to communicate an *idea* which itself cannot
be encumbered by any license, anyway. But if you want to incorporate
some of this code into an asf project, I'd be happy to also release it
under the apache license. Hope the license I chose for my project
doesn't get in the way
Babak Farhang wrote:
Most of all, I'm trying to communicate an *idea* which itself cannot
be encumbered by any license, anyway. But if you want to incorporate
some of this code into an asf project, I'd be happy to also release it
under the apache license. Hope the license I chose for my project
Marvin Humphrey mar...@rectangular.com wrote:
4) Allow 2 concurrent writers: one for small, fast updates, and one for
big background merges.
Marvin can you describe more detail here? It sounds like this is your
solution for decoupling segments changes due to merges from changes
from docs
[
https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659235#action_12659235
]
Michael McCandless commented on LUCENE-1314:
OK I reviewed the patch; some
On Fri, Dec 26, 2008 at 2:11 PM, Babak Farhang farh...@gmail.com wrote:
Most of all, I'm trying to communicate an *idea* which itself cannot
be encumbered by any license, anyway. But if you want to incorporate
some of this code into an asf project, I'd be happy to also release it
under the
[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659240#action_12659240
]
Michael McCandless commented on LUCENE-1483:
Given how different the results
On Dec 26, 2008, at 9:07 AM, Noble Paul നോബിള്
नोब्ळ् wrote:
On Fri, Dec 26, 2008 at 2:11 PM, Babak Farhang farh...@gmail.com
wrote:
BTW . The license is a problem
The license is a problem if Babak intends to donate it to the ASF.
And it may be a problem for companies who don't
Similar thoughts here. I don't have ML thread pointers nor JIRA issue
pointers, but there has been discussion in this area before, and I believe the
thinking was that what's needed is a general interface/abstraction/API for
storing and loading field data to an external component, be that a
That could very well be, but I was referencing your statement:
1) Design index formats that can be memory mapped rather than slurped,
bringing the cost of opening/reopening an IndexReader down to a
negligible level.
The only reason to do this (or have it happen) is if you perform a
Also, if you are really set on the mmap strategy, why not use the single file
with fixed length pages, using the header I proposed (and key compression). You
don't need any fancy partial page stuff, just waste a small amount of space at
the end of pages.
I think this is going to far faster
Michael McCandless wrote:
So then I think we should start with approach #2 (build real-time on
top of the Lucene core) and iterate from there. Newly added docs go
into a tiny segments, which IndexReader.reopen pulls in. Replaced or
deleted docs record the delete against the right SegmentReader
The addition of docs into tiny segments using the current data structures
seems the right way to go. Sometime back one of my engineers implemented
pseudo real-time using MultiSearcher by having an in-memory (RAM based)
short-term index that auto-merged into a disk-based long term index that
On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote:
4) Allow 2 concurrent writers: one for small, fast updates, and one for
big background merges.
Marvin can you describe more detail here?
The goal is to improve worst-case write performance.
Currently, writes are
One thing that I forgot to mention is that in our implementation the
real-time indexing took place with many folder-based listeners writing to
many tiny in-memory indexes partitioned by sub-sources with fewer
long-term and archive indexes per box. Overall distributed search across
various
This is what we mostly do, but we serialize the documents to a log file first,
so if server crashes before the background merge of the RAM segments into the
disk segments completes, we can replay the operations on server restart. Since
the serialize is a sequential write to an already open
If you move to the either embedded, or server model, the post reopen is
trivial, as the structures can be created as the segment is written.
It is the networked shared access model that causes a lot of these
optimizations to be far more complex than needed.
Would it maybe be simpler to move
There is also the distributed model - but in that case each node is running
some sort of server anyway (as in Hadoop).
It seems that the distributed model would be easier to develop using Hadoop
over the embedded model.
-Original Message-
From: Robert Engels reng...@ix.netcom.com
Sent:
Robert,
Three exchanges ago in this thread, you made the incorrect assumption that the
motivation behind using mmap was read speed, and that memory mapping was being
waved around as some sort of magic wand:
Is there something that I am missing? I see lots of references to
using memory
You are full of crap. From your own comments in Lucene 1458:
The work on streamlining the term dictionary is excellent, but perhaps we can
do better still. Can we design a format that allows us rely upon the operating
system's virtual memory and avoid caching in process memory altogether?
Say
Robert Engels wrote:
You are full of **beep** *beep* ...
No matter whether you are right or wrong, please keep a civil tone on
this public forum. We are professionals here, so let's discuss and
disagree if must be - but in a professional and grown-up way. Thank you.
--
Best regards,
Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for
stored fields?
Personally I don't put huge amounts of text in stored fields but these
encodings/compression work extremely well on short strings like titles, etc.
Removing the unicode penalty for non-latin text (i.e. cut in
On Fri, Dec 26, 2008 at 10:05 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
Similar thoughts here. I don't have ML thread pointers nor JIRA issue
pointers, but there has been discussion in this area before, and I believe
the thinking was that what's needed is a general
Thanks!
I look forward to getting back into this soon -- the holidays sure
suck up more time then we imagine!
Happy holidays to everyone.
ryan
On Dec 24, 2008, at 12:48 AM, Chris Hostetter wrote:
I'm happy to announce that in recognition of his efforts in moving
forward with creating
23 matches
Mail list logo