Re: Blob storage

2008-12-26 Thread Babak Farhang
Most of all, I'm trying to communicate an *idea* which itself cannot be encumbered by any license, anyway. But if you want to incorporate some of this code into an asf project, I'd be happy to also release it under the apache license. Hope the license I chose for my project doesn't get in the way

Re: Blob storage

2008-12-26 Thread Ian Holsman
Babak Farhang wrote: Most of all, I'm trying to communicate an *idea* which itself cannot be encumbered by any license, anyway. But if you want to incorporate some of this code into an asf project, I'd be happy to also release it under the apache license. Hope the license I chose for my project

Re: Realtime Search

2008-12-26 Thread Michael McCandless
Marvin Humphrey mar...@rectangular.com wrote: 4) Allow 2 concurrent writers: one for small, fast updates, and one for big background merges. Marvin can you describe more detail here? It sounds like this is your solution for decoupling segments changes due to merges from changes from docs

[jira] Commented: (LUCENE-1314) IndexReader.clone

2008-12-26 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659235#action_12659235 ] Michael McCandless commented on LUCENE-1314: OK I reviewed the patch; some

Re: Blob storage

2008-12-26 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Fri, Dec 26, 2008 at 2:11 PM, Babak Farhang farh...@gmail.com wrote: Most of all, I'm trying to communicate an *idea* which itself cannot be encumbered by any license, anyway. But if you want to incorporate some of this code into an asf project, I'd be happy to also release it under the

[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-26 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659240#action_12659240 ] Michael McCandless commented on LUCENE-1483: Given how different the results

Re: Blob storage

2008-12-26 Thread Grant Ingersoll
On Dec 26, 2008, at 9:07 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote: On Fri, Dec 26, 2008 at 2:11 PM, Babak Farhang farh...@gmail.com wrote: BTW . The license is a problem The license is a problem if Babak intends to donate it to the ASF. And it may be a problem for companies who don't

Re: Blob storage

2008-12-26 Thread Otis Gospodnetic
Similar thoughts here. I don't have ML thread pointers nor JIRA issue pointers, but there has been discussion in this area before, and I believe the thinking was that what's needed is a general interface/abstraction/API for storing and loading field data to an external component, be that a

Re: Realtime Search

2008-12-26 Thread Robert Engels
That could very well be, but I was referencing your statement: 1) Design index formats that can be memory mapped rather than slurped, bringing the cost of opening/reopening an IndexReader down to a negligible level. The only reason to do this (or have it happen) is if you perform a

Re: Realtime Search

2008-12-26 Thread Robert Engels
Also, if you are really set on the mmap strategy, why not use the single file with fixed length pages, using the header I proposed (and key compression). You don't need any fancy partial page stuff, just waste a small amount of space at the end of pages. I think this is going to far faster

Re: Realtime Search

2008-12-26 Thread Doug Cutting
Michael McCandless wrote: So then I think we should start with approach #2 (build real-time on top of the Lucene core) and iterate from there. Newly added docs go into a tiny segments, which IndexReader.reopen pulls in. Replaced or deleted docs record the delete against the right SegmentReader

Re: Realtime Search

2008-12-26 Thread J. Delgado
The addition of docs into tiny segments using the current data structures seems the right way to go. Sometime back one of my engineers implemented pseudo real-time using MultiSearcher by having an in-memory (RAM based) short-term index that auto-merged into a disk-based long term index that

Re: Realtime Search

2008-12-26 Thread Marvin Humphrey
On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote: 4) Allow 2 concurrent writers: one for small, fast updates, and one for big background merges. Marvin can you describe more detail here? The goal is to improve worst-case write performance. Currently, writes are

Re: Realtime Search

2008-12-26 Thread J. Delgado
One thing that I forgot to mention is that in our implementation the real-time indexing took place with many folder-based listeners writing to many tiny in-memory indexes partitioned by sub-sources with fewer long-term and archive indexes per box. Overall distributed search across various

Re: Realtime Search

2008-12-26 Thread Robert Engels
This is what we mostly do, but we serialize the documents to a log file first, so if server crashes before the background merge of the RAM segments into the disk segments completes, we can replay the operations on server restart. Since the serialize is a sequential write to an already open

Re: Realtime Search

2008-12-26 Thread Robert Engels
If you move to the either embedded, or server model, the post reopen is trivial, as the structures can be created as the segment is written. It is the networked shared access model that causes a lot of these optimizations to be far more complex than needed. Would it maybe be simpler to move

Re: Realtime Search

2008-12-26 Thread Robert Engels
There is also the distributed model - but in that case each node is running some sort of server anyway (as in Hadoop). It seems that the distributed model would be easier to develop using Hadoop over the embedded model. -Original Message- From: Robert Engels reng...@ix.netcom.com Sent:

Re: Realtime Search

2008-12-26 Thread Marvin Humphrey
Robert, Three exchanges ago in this thread, you made the incorrect assumption that the motivation behind using mmap was read speed, and that memory mapping was being waved around as some sort of magic wand: Is there something that I am missing? I see lots of references to using memory

Re: Realtime Search

2008-12-26 Thread Robert Engels
You are full of crap. From your own comments in Lucene 1458: The work on streamlining the term dictionary is excellent, but perhaps we can do better still. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether? Say

Re: Realtime Search

2008-12-26 Thread Andrzej Bialecki
Robert Engels wrote: You are full of **beep** *beep* ... No matter whether you are right or wrong, please keep a civil tone on this public forum. We are professionals here, so let's discuss and disagree if must be - but in a professional and grown-up way. Thank you. -- Best regards,

stored fields / unicode compression

2008-12-26 Thread Robert Muir
Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for stored fields? Personally I don't put huge amounts of text in stored fields but these encodings/compression work extremely well on short strings like titles, etc. Removing the unicode penalty for non-latin text (i.e. cut in

Re: Blob storage

2008-12-26 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Fri, Dec 26, 2008 at 10:05 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Similar thoughts here. I don't have ML thread pointers nor JIRA issue pointers, but there has been discussion in this area before, and I believe the thinking was that what's needed is a general

Re: ANNOUNCE: Welcome Ryan McKinley as Contrib/Documentation Committer

2008-12-26 Thread Ryan McKinley
Thanks! I look forward to getting back into this soon -- the holidays sure suck up more time then we imagine! Happy holidays to everyone. ryan On Dec 24, 2008, at 12:48 AM, Chris Hostetter wrote: I'm happy to announce that in recognition of his efforts in moving forward with creating