Re: Blob storage

Babak Farhang Sun, 28 Dec 2008 10:54:06 -0800

> PS I am relicensing skwish under Apache 2.0 and am in the process of
> re-releasing it from the project website under this new license. (I'm
> not that efficient with such chores, so bear with me :)


DONE

On Sat, Dec 27, 2008 at 1:15 AM, Babak Farhang <[email protected]> wrote:
>> - there will be a trade-off where reading the info from a 2nd system would
>> be slower than just a single call which has all the results. Especially if
>> you have to fetch a couple of these things.
>
> I agree.  So, for example, when an app is displaying the search
> results on a web page, it is certainly more efficient if the stored
> fields used in the presentation layer are stored directly in Lucene.
> But if the search results also contain links for fetching the original
> documents that were indexed by Lucene, and those documents are also
> managed by the app, then reading from the 2nd system will be
> advantageous. I suggest a rule-of-thumb may be that when it's the
> end-user that initiates the fetch of a stored field then the trade-off
> is worthwhile.
>
>> - how is this different than BDB, and a UUID. couldn't you just store it
>> using that?
>
> You could, and there are certainly a lot of advantages to using BDB.
> I think skwish might enjoy an advantage over BDB in the efficient I/O
> interface it exposes.  For example, using skwish you can retrieve a
> stored value as a java FileChannel which you can then write more
> efficiently to a socket.  (This idea is also fleshed out in the
> prototype non-blocking HTTP server included in the last release.)
>
>> - how are you going to deal with situations where the commit fails in
>> lucene. does the client have to recognize this and rollback skwish?
>
> Here's an example work flow from an experimental lucene / skwish mash
> up I'm trying:
>
> 1. New documents come in to be stored and indexed.
> 2. I store the contents of each document in a SegmentStore (skwish
> term. for a blob store) called *cache* (I guess after google cache?)
> 3. I store meta information about  each document in another
> SegmentStore called *meta*.  This meta information includes the skwish
> id of the document contents in the *cache* store from step 2.
> 4. I run a Lucene IndexWriter over the new documents.  String fields
> are read from the *meta* store; stream fields are read from the
> *cache* store.  Most of the fields in the *meta* store are duplicated
> (stored) and/or indexed by Lucene.  Nothing from the *cache* store is
> stored directly by Lucene.
>
> At this point, the documents are stored and indexed in the mash up.
> If any of those steps were to fail, the step would have to be retried.
> Note there is enough information in the 2 skwish stores cache and
> meta, to recreate the entire Lucene index.
>
> 5. The mash up returns search results.  Each search result item
> contains a link to the original document stored in the cache store
> (indirected through the cache id stored in a Lucene Field).
>
> The implementation aside, I don't imagine this work flow is too
> atypical.  You typically have to store the documents *somewhere*, and
> from the application's perspective, it makes sense to leave enough
> [meta] information around in order to be able to recreate a Lucene
> index from scratch.
>
>> on a positive note, it would shrink the index size and allow more records to
>> fit in memory.
>
> Especially important if a stored field (e.g. the contents of the
> original document) does not fit in memory.
>
> Regards
> -Babak
>
> PS I am relicensing skwish under Apache 2.0 and am in the process of
> re-releasing it from the project website under this new license. (I'm
> not that efficient with such chores, so bear with me :)
>
>
>
> On Fri, Dec 26, 2008 at 3:40 AM, Ian Holsman <[email protected]> wrote:
>> Babak Farhang wrote:
>>>
>>> Most of all, I'm trying to communicate an *idea* which itself cannot
>>> be encumbered by any license, anyway. But if you want to incorporate
>>> some of this code into an asf project, I'd be happy to also release it
>>> under the apache license. Hope the license I chose for my project
>>> doesn't get in the way of this conversation..
>>>
>>
>> as an idea, let me offer some thoughts.
>> - there will be a trade-off where reading the info from a 2nd system would
>> be slower than just a single call which has all the results. Especially if
>> you have to fetch a couple of these things.
>>
>> - how is this different than BDB, and a UUID. couldn't you just store it
>> using that?
>>
>> - how are you going to deal with situations where the commit fails in
>> lucene. does the client have to recognize this and rollback skwish?
>>
>> - there will need to be some kind of reconciliation process that will need
>> to deal with inconsistencies where someone forgets to delete the skiwsh
>> object when they have deleted the lucene record.
>>
>> on a positive note, it would shrink the index size and allow more records to
>> fit in memory.
>>
>> Regards
>> Ian
>>>
>>> On Fri, Dec 26, 2008 at 12:46 AM, Noble Paul നോബിള്‍ नोब्ळ्
>>> <[email protected]> wrote:
>>>
>>>>
>>>> The license is GPL . It cannont be used directly in any apache projects
>>>>
>>>> On Fri, Dec 26, 2008 at 12:47 PM, Babak Farhang <[email protected]>
>>>> wrote:
>>>>
>>>>>>
>>>>>> I assume one could use Skwish instead of Lucene's normal stored fields
>>>>>> to
>>>>>> store & retrieve document data?
>>>>>>
>>>>>
>>>>> Exactly: instead of storing the field's value directly in Lucene, you
>>>>> could store it in skwish and then store its skwish id in the Lucene
>>>>> field instead.  This works well for serving large streams (e.g.
>>>>> original document contents).
>>>>>
>>>>>
>>>>>>
>>>>>> Have you run any threaded performance tests comparing the two?
>>>>>>
>>>>>
>>>>> No direct comps, yet.
>>>>>
>>>>> -b
>>>>>
>>>>>
>>>>> On Thu, Dec 25, 2008 at 5:22 AM, Michael McCandless
>>>>> <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> This looks interesting!
>>>>>> I assume one could use Skwish instead of Lucene's normal stored fields
>>>>>> to
>>>>>> store & retrieve document data?
>>>>>> Have you run any threaded performance tests comparing the two?
>>>>>> Mike
>>>>>>
>>>>>> Babak Farhang <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I've been working on a library called Skwish to complement indexes
>>>>>>> like Lucene,  for blob storage and retrieval. This is nothing more
>>>>>>> than a structured implementation of storing all the files in one file
>>>>>>> and managing their offsets in another.  The idea is to provide a fast,
>>>>>>> concurrent, lock-free way to serve lots of files to lots of users.
>>>>>>>
>>>>>>> Hope you find it useful or interesting.
>>>>>>>
>>>>>>> -Babak
>>>>>>> http://skwish.sourceforge.net/
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: Blob storage

Reply via email to