Re: Blob storage

Babak Farhang Mon, 29 Dec 2008 22:01:46 -0800

> the thinking was that what's needed is a general interface/abstraction/API 
> for storing and loading field data to an external component

Implementation pluggability is a sensible route to take.

Let me speculate on how such an API might work; if I wander astray, do
interrupt me.

Consider loading the field data, first.  The implementation layer (the
thing abstracted away by the API) will need, at minimum, a key from
the caller to retrieve the field data.  Under the covers, Lucene will
maintain the "external" field value by storing this key.

The key needed at the implementation layer may be an int, a long, a
string, whatever.  At the API level, the most general representation
of the key will likely be just a byte array (or perhaps a
ByteBuffer?).

So the service provider interface (ignoring transactions, for the
moment) might have a method like

interface FieldStoreProvider {

    FieldValue getFieldValue(byte[] key);

    . .
}

where FieldValue is just a stand-in for one or more yet-to-be-defined types.

This type of method would be easy to implement with BDB, Skwish, and a
little clumsily with an RDBMS (since you must either commit to a
particular DB schema, or configure a tool that maps values from an
existing DB schema to their API *key* representation above, e.g. at it
most primitive level, the key specifies an SQL query).

Now let's consider storing the field value.  This also brings up 2
other issues; 1) where does the key come from (who specifies the key)?
and 2) whether the field value can be updated?  Below, some example
method signatures illustrating some of these design choices..

interface FieldStoreProvider {

    . .

    // inserts the given field value and returns a key
    // created by the provider implementation
    //
    byte[] putFieldValue(FieldValue value);

    // inserts the given field value with specified key;
    // it's the responsibility of the caller to ensure that
    // the key is unique
    //
    void insertFieldValue(FieldValue value, byte[] key);

    // updates the field value for the given key
    //
    void updateFieldValue(FieldValue value, byte[] key);

    . .
}

The first method returns an implementation-specific key on data
insertion.  This method can be efficiently implemented using BDB,
RDBMSs, or something like skwish.  If we leave the key generation to
the implementation, the returned key would likely be just an encoding
of an integral value.

The next (2nd) method above, however, allows the user to specify an
arbitrary (but unique) key for the data being inserted.  This can be
implemented using either BDB or an RDBMS, but not something
dead-simple like skwish.  The RDBMS implementation would necessarily
involve some kind of fixed database schema.

Here's my thinking on the pros and cons of allowing the user (rather
than the implementation) to specify the key for an externally stored
field value.

Pros:
* It allows one to specify a globally unique key, e.g. a UUID
* It might make accessing the externally stored field value from
outside lucene easier.  E.g. the field value can be accessed from the
outside using a UUID or a path-like structrue.
* It allows an application to have apriori knowledge of what the key
will be *before* the field data is committed (inserted) to external
storage. This might simplify constructing application-specific object
graphs and the like.

Cons:
*  You can't use [something like] skwish.  Performance will be
impacted. This has to do with the fact that skwish implements a random
access list (the key just represents the index [offset] into the list)
whereas BDB implements a map.

The last (3rd) method introduced above, updating a field value using
the field's key, is obviously useful.  However, if the externally
stored field is indexed by Lucene, (say by being the source of a
field's Field.readerValue() ), then the Lucene document must itself be
replaced.  In that case, this method in the provider interface is of
little value to Lucene.  This method can be implemented using BDB, an
RDBMS, but not with skwish.

So those last 3 write methods represent some fundamental design
choices. And I would argue that the best choice is to make no choice
at all, and let the user do the writing (with the option of updates)
directly through the backing external storage implementation instead.
The service provider interface Lucene is concerned with would be
read-only, and we'd be back to

interface FieldStoreProvider {

    FieldValue getFieldValue(byte[] key);

}

where FieldValue was a yet-to-be-defined type.

Now about FieldValue. I think it would make be nice if we could
somehow expand the interface definition of a Field to include a method
that returns a stream-representation of a stored field value. That
way, the memory overhead for accessing large stored fields could be
minimized. In that event, the provider interface might include methods
like

interface FieldStoreProvider {

    byte[] getBytes(byte[] key);

    FileChannel getChannel(byte[] key);

    long getSize(byte[] key);

    int copyInto(byte[] key, byte[] sink);

    int copyInto(byte[] key, ByteBuffer sink);

}

There are a lot more issues to explore, such as how the external
storage provider is configured.  (E.g. is it VM-specific or
DbDirectory-specific?) But I think I should pause to see whether I'm
on the right page, or have fallen down an elevator shaft.

Regards,
-Babak

On Fri, Dec 26, 2008 at 9:35 AM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
> Similar thoughts here.  I don't have ML thread pointers nor JIRA issue 
> pointers, but there has been discussion in this area before, and I believe 
> the thinking was that what's needed is a general interface/abstraction/API 
> for storing and loading field data to an external component, be that a BDB, 
> an RDBMS, or something like Skwish.  I *think* that often came up in the 
> context of Document updates (as opposed to delete+add).
>
>
> I didn't look at Skwish, but I think this is the direction to explore, Babak, 
> esp. if we can come up with something that let's one plug in other types of 
> storage, as well as deal with transaction type stuff that Ian mentioned.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Ian Holsman <li...@holsman.net>
>> To: java-dev@lucene.apache.org
>> Sent: Friday, December 26, 2008 5:40:36 AM
>> Subject: Re: Blob storage
>>
>> Babak Farhang wrote:
>> > Most of all, I'm trying to communicate an *idea* which itself cannot
>> > be encumbered by any license, anyway. But if you want to incorporate
>> > some of this code into an asf project, I'd be happy to also release it
>> > under the apache license. Hope the license I chose for my project
>> > doesn't get in the way of this conversation..
>> >
>>
>> as an idea, let me offer some thoughts.
>> - there will be a trade-off where reading the info from a 2nd system
>> would be slower than just a single call which has all the results.
>> Especially if you have to fetch a couple of these things.
>>
>> - how is this different than BDB, and a UUID. couldn't you just store it
>> using that?
>>
>> - how are you going to deal with situations where the commit fails in
>> lucene. does the client have to recognize this and rollback skwish?
>>
>> - there will need to be some kind of reconciliation process that will
>> need to deal with inconsistencies where someone forgets to delete the
>> skiwsh object when they have deleted the lucene record.
>>
>> on a positive note, it would shrink the index size and allow more
>> records to fit in memory.
>>
>> Regards
>> Ian
>> > On Fri, Dec 26, 2008 at 12:46 AM, Noble Paul നോബിള്‍ नोब्ळ्
>> > wrote:
>> >
>> >> The license is GPL . It cannont be used directly in any apache projects
>> >>
>> >> On Fri, Dec 26, 2008 at 12:47 PM, Babak Farhang wrote:
>> >>
>> >>>> I assume one could use Skwish instead of Lucene's normal stored fields 
>> >>>> to
>> >>>> store & retrieve document data?
>> >>>>
>> >>> Exactly: instead of storing the field's value directly in Lucene, you
>> >>> could store it in skwish and then store its skwish id in the Lucene
>> >>> field instead.  This works well for serving large streams (e.g.
>> >>> original document contents).
>> >>>
>> >>>
>> >>>> Have you run any threaded performance tests comparing the two?
>> >>>>
>> >>> No direct comps, yet.
>> >>>
>> >>> -b
>> >>>
>> >>>
>> >>> On Thu, Dec 25, 2008 at 5:22 AM, Michael McCandless
>> >>> wrote:
>> >>>
>> >>>> This looks interesting!
>> >>>> I assume one could use Skwish instead of Lucene's normal stored fields 
>> >>>> to
>> >>>> store & retrieve document data?
>> >>>> Have you run any threaded performance tests comparing the two?
>> >>>> Mike
>> >>>>
>> >>>> Babak Farhang wrote:
>> >>>>
>> >>>>> Hi everyone,
>> >>>>>
>> >>>>> I've been working on a library called Skwish to complement indexes
>> >>>>> like Lucene,  for blob storage and retrieval. This is nothing more
>> >>>>> than a structured implementation of storing all the files in one file
>> >>>>> and managing their offsets in another.  The idea is to provide a fast,
>> >>>>> concurrent, lock-free way to serve lots of files to lots of users.
>> >>>>>
>> >>>>> Hope you find it useful or interesting.
>> >>>>>
>> >>>>> -Babak
>> >>>>> http://skwish.sourceforge.net/
>> >>>>>
>> >>>>> ---------------------------------------------------------------------
>> >>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> >>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>> >>>>>
>> >>>>>
>> >>>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>> >>>
>> >>>
>> >>>
>> >>
>> >> --
>> >> --Noble Paul
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>> >>
>> >>
>> >>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Blob storage

Reply via email to