Re: Blob storage

Babak Farhang Sat, 27 Dec 2008 00:16:10 -0800

> - there will be a trade-off where reading the info from a 2nd system would
> be slower than just a single call which has all the results. Especially if
> you have to fetch a couple of these things.

I agree.  So, for example, when an app is displaying the search
results on a web page, it is certainly more efficient if the stored
fields used in the presentation layer are stored directly in Lucene.
But if the search results also contain links for fetching the original
documents that were indexed by Lucene, and those documents are also
managed by the app, then reading from the 2nd system will be
advantageous. I suggest a rule-of-thumb may be that when it's the
end-user that initiates the fetch of a stored field then the trade-off
is worthwhile.

> - how is this different than BDB, and a UUID. couldn't you just store it
> using that?

You could, and there are certainly a lot of advantages to using BDB.
I think skwish might enjoy an advantage over BDB in the efficient I/O
interface it exposes.  For example, using skwish you can retrieve a
stored value as a java FileChannel which you can then write more
efficiently to a socket.  (This idea is also fleshed out in the
prototype non-blocking HTTP server included in the last release.)

> - how are you going to deal with situations where the commit fails in
> lucene. does the client have to recognize this and rollback skwish?

Here's an example work flow from an experimental lucene / skwish mash
up I'm trying:

1. New documents come in to be stored and indexed.
2. I store the contents of each document in a SegmentStore (skwish
term. for a blob store) called *cache* (I guess after google cache?)
3. I store meta information about  each document in another
SegmentStore called *meta*.  This meta information includes the skwish
id of the document contents in the *cache* store from step 2.
4. I run a Lucene IndexWriter over the new documents.  String fields
are read from the *meta* store; stream fields are read from the
*cache* store.  Most of the fields in the *meta* store are duplicated
(stored) and/or indexed by Lucene.  Nothing from the *cache* store is
stored directly by Lucene.

At this point, the documents are stored and indexed in the mash up.
If any of those steps were to fail, the step would have to be retried.
Note there is enough information in the 2 skwish stores cache and
meta, to recreate the entire Lucene index.

5. The mash up returns search results.  Each search result item
contains a link to the original document stored in the cache store
(indirected through the cache id stored in a Lucene Field).

The implementation aside, I don't imagine this work flow is too
atypical.  You typically have to store the documents *somewhere*, and
from the application's perspective, it makes sense to leave enough
[meta] information around in order to be able to recreate a Lucene
index from scratch.

> on a positive note, it would shrink the index size and allow more records to
> fit in memory.

Especially important if a stored field (e.g. the contents of the
original document) does not fit in memory.

Regards
-Babak

PS I am relicensing skwish under Apache 2.0 and am in the process of
re-releasing it from the project website under this new license. (I'm
not that efficient with such chores, so bear with me :)

On Fri, Dec 26, 2008 at 3:40 AM, Ian Holsman <li...@holsman.net> wrote:
> Babak Farhang wrote:
>>
>> Most of all, I'm trying to communicate an *idea* which itself cannot
>> be encumbered by any license, anyway. But if you want to incorporate
>> some of this code into an asf project, I'd be happy to also release it
>> under the apache license. Hope the license I chose for my project
>> doesn't get in the way of this conversation..
>>
>
> as an idea, let me offer some thoughts.
> - there will be a trade-off where reading the info from a 2nd system would
> be slower than just a single call which has all the results. Especially if
> you have to fetch a couple of these things.
>
> - how is this different than BDB, and a UUID. couldn't you just store it
> using that?
>
> - how are you going to deal with situations where the commit fails in
> lucene. does the client have to recognize this and rollback skwish?
>
> - there will need to be some kind of reconciliation process that will need
> to deal with inconsistencies where someone forgets to delete the skiwsh
> object when they have deleted the lucene record.
>
> on a positive note, it would shrink the index size and allow more records to
> fit in memory.
>
> Regards
> Ian
>>
>> On Fri, Dec 26, 2008 at 12:46 AM, Noble Paul നോബിള്‍ नोब्ळ्
>> <noble.p...@gmail.com> wrote:
>>
>>>
>>> The license is GPL . It cannont be used directly in any apache projects
>>>
>>> On Fri, Dec 26, 2008 at 12:47 PM, Babak Farhang <farh...@gmail.com>
>>> wrote:
>>>
>>>>>
>>>>> I assume one could use Skwish instead of Lucene's normal stored fields
>>>>> to
>>>>> store & retrieve document data?
>>>>>
>>>>
>>>> Exactly: instead of storing the field's value directly in Lucene, you
>>>> could store it in skwish and then store its skwish id in the Lucene
>>>> field instead.  This works well for serving large streams (e.g.
>>>> original document contents).
>>>>
>>>>
>>>>>
>>>>> Have you run any threaded performance tests comparing the two?
>>>>>
>>>>
>>>> No direct comps, yet.
>>>>
>>>> -b
>>>>
>>>>
>>>> On Thu, Dec 25, 2008 at 5:22 AM, Michael McCandless
>>>> <luc...@mikemccandless.com> wrote:
>>>>
>>>>>
>>>>> This looks interesting!
>>>>> I assume one could use Skwish instead of Lucene's normal stored fields
>>>>> to
>>>>> store & retrieve document data?
>>>>> Have you run any threaded performance tests comparing the two?
>>>>> Mike
>>>>>
>>>>> Babak Farhang <farh...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I've been working on a library called Skwish to complement indexes
>>>>>> like Lucene,  for blob storage and retrieval. This is nothing more
>>>>>> than a structured implementation of storing all the files in one file
>>>>>> and managing their offsets in another.  The idea is to provide a fast,
>>>>>> concurrent, lock-free way to serve lots of files to lots of users.
>>>>>>
>>>>>> Hope you find it useful or interesting.
>>>>>>
>>>>>> -Babak
>>>>>> http://skwish.sourceforge.net/
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>> --
>>> --Noble Paul
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Blob storage

Reply via email to