Re: [Mavibot] Value storage

2013-09-16 Thread Emmanuel Lécharny
Le 9/16/13 7:59 PM, Kiran Ayyagari a écrit :
> On Mon, Sep 16, 2013 at 5:25 PM, Emmanuel Lécharny wrote:
>
>
>> IMO, I think we should have the getValues() to return an iterator, and
>> the getValue() method to returns either the single value or the first
>> value if there are more than one.
>>
>> I still prefer to get a BTree(that gives us better control if we want to
> search for a
> value in BTree)

Except that it exposes the internal representation. If we want to change
it later, that will also change the API...
> otherwise why not return a cursor? (personally, the last preference)

Cursor seems a good idea. At least it allows browing the element in both
directions.

Something to consider.


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com 



Re: [Mavibot] Value storage

2013-09-16 Thread Kiran Ayyagari
On Mon, Sep 16, 2013 at 5:25 PM, Emmanuel Lécharny wrote:

> Let's get back to what we need, from the BTree POV...
>
> We have a BTree storing Values associated with Keys. A given Key can be
> assoicated to one or many values, but most of the time, we will have
> only one value, and if we have many values, we will not have that many.
> But occasionally, we may have thousands of values associated with a
> single key.
>
> What happens if we have thousands of values for a given key ? It's
> become very costly to manipulate the values, and as we have to order the
> values to be able to find out if a value is already present in a decent
> delay (ie, O(log(n) if the values are ordered, compared to O(n) if they
> are not). For this very reason, we would like to use a decent way to
> store the values, and this is why we do use a sub-btree (BTree).
>
> In ApacheDS and JDBM, we have a threshold number which is used to know
> when to create a sub-btree in this case : for instance, if we have more
> than 500 values, we will use a sub-BTree, otherwise we use an array of
> values. The reason is that up to a point, it's faster to manipulate an
> array in memory than a BTree (the tiping poing though is not obvious :
> depending on the value's size, this threshold number will vary).
>
> So we want to store :
> - a single value most of the case
> - an array of values when we have more than 1 value, but below N values
> (N to be evaulated)
> - a sub-BTree when we have more than N values.
>
> Now, we can simplify this scheme in two ways :
> - we can ditch the single value, and use an array instead, and switch to
> a sub-BTree at some point
> - we can decide not to use an array at all, and use a sub-BTree when we
> ave more than one value
> - we can also always use a sub-btree when we have an index allowing
> muliple values.
>
> So far, we decided to not go with option 3, as it puts a huge strain on
> the BTree performances (the very first version was doing exactly taht,
> and using something different improved the performances by 15%).
>
> ATM, we are experimenting option 2, but we may later go for option 1.
>
> That beng said, the BTree operations which manipulate the values are the
> following :
>
> - getValue(K)
> - getValues(K)
> - insert(K, V)
> - contains(K, V)
> - delete(K, V)
>
> As we can see, we have two methods which return values, one whichs
> return one single value, and the other which returns all the values. The
> question here is why do we have two methods? That forces the users to
> know when to use one or the other, ie the users must know if the BTree
> allows multiple values.
>
> The suggestion would be to make the getValues() method to return an
> iterator. That's fine, but what about the semantic of the getValue()
> method, when we have more than one value stored ? Should it returns the
> very first value ?
>
yeap, this is what it does now, the first value will be returned if there
are more than
one value

>
> IMO, I think we should have the getValues() to return an iterator, and
> the getValue() method to returns either the single value or the first
> value if there are more than one.
>
> I still prefer to get a BTree(that gives us better control if we want to
search for a
value in BTree)
otherwise why not return a cursor? (personally, the last preference)

>
> wdyt ?
>
>
> Le 9/16/13 12:10 PM, Pierre-Arnaud Marcelot a écrit :
> > I would go with a simple Iterator.
> >
> > Regards,
> > Pierre-Arnaud
> >
> > On 16 sept. 2013, at 11:51, Emmanuel Lécharny 
> wrote:
> >
> >> Le 9/16/13 11:41 AM, Kiran Ayyagari a écrit :
> >>> On Mon, Sep 16, 2013 at 3:00 PM, Emmanuel Lécharny <
> elecha...@gmail.com>wrote:
> >>>
>  Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit :
> > On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny <
> elecha...@gmail.com
> > wrote:
> >
> >> Hi guys,
> >>
> >>
> >>/**
> >> * @return The array of stored values.
> >> */
> >>V[] getValues();
> >>
> >> shouldn't this be returning a BTree? cause we don't support an
>  array
> > as a holder of
> > multiple values
>  The fact that it's stored as a BTree is implementation dependent, I'm
>  not sure the user of the API should know about it. What the user wants
>  is to get back the stored values, and this is convenient to get back
> an
>  array.
> 
>  IMO, it is not convenient, we cannot copy all the values into an array
> >>> and return, and returning an iterator is not going to help either,
> both will
> >>> severely affect the search performance
> >>> If needed, we can wrap it an immutable structure and return the BTree
> >>> to prevent direct updates, but I wish to differ this until we see the
> impact
> >>> on partition implementation.
> >> Except that from the BTree interface, the one the users see, you have :
> >>
> >>public V get( K key ) throws IOException, KeyNotFoundException
> >>public DuplicateKeyVal getValues( K key ) throws IOE

Re: [Mavibot] Value storage

2013-09-16 Thread Emmanuel Lécharny
Let's get back to what we need, from the BTree POV...

We have a BTree storing Values associated with Keys. A given Key can be
assoicated to one or many values, but most of the time, we will have
only one value, and if we have many values, we will not have that many.
But occasionally, we may have thousands of values associated with a
single key.

What happens if we have thousands of values for a given key ? It's
become very costly to manipulate the values, and as we have to order the
values to be able to find out if a value is already present in a decent
delay (ie, O(log(n) if the values are ordered, compared to O(n) if they
are not). For this very reason, we would like to use a decent way to
store the values, and this is why we do use a sub-btree (BTree).

In ApacheDS and JDBM, we have a threshold number which is used to know
when to create a sub-btree in this case : for instance, if we have more
than 500 values, we will use a sub-BTree, otherwise we use an array of
values. The reason is that up to a point, it's faster to manipulate an
array in memory than a BTree (the tiping poing though is not obvious :
depending on the value's size, this threshold number will vary).

So we want to store :
- a single value most of the case
- an array of values when we have more than 1 value, but below N values
(N to be evaulated)
- a sub-BTree when we have more than N values.

Now, we can simplify this scheme in two ways :
- we can ditch the single value, and use an array instead, and switch to
a sub-BTree at some point
- we can decide not to use an array at all, and use a sub-BTree when we
ave more than one value
- we can also always use a sub-btree when we have an index allowing
muliple values.

So far, we decided to not go with option 3, as it puts a huge strain on
the BTree performances (the very first version was doing exactly taht,
and using something different improved the performances by 15%).

ATM, we are experimenting option 2, but we may later go for option 1.

That beng said, the BTree operations which manipulate the values are the
following :

- getValue(K)
- getValues(K)
- insert(K, V)
- contains(K, V)
- delete(K, V)

As we can see, we have two methods which return values, one whichs
return one single value, and the other which returns all the values. The
question here is why do we have two methods? That forces the users to
know when to use one or the other, ie the users must know if the BTree
allows multiple values.

The suggestion would be to make the getValues() method to return an
iterator. That's fine, but what about the semantic of the getValue()
method, when we have more than one value stored ? Should it returns the
very first value ?

IMO, I think we should have the getValues() to return an iterator, and
the getValue() method to returns either the single value or the first
value if there are more than one.


wdyt ?


Le 9/16/13 12:10 PM, Pierre-Arnaud Marcelot a écrit :
> I would go with a simple Iterator.
>
> Regards,
> Pierre-Arnaud
>
> On 16 sept. 2013, at 11:51, Emmanuel Lécharny  wrote:
>
>> Le 9/16/13 11:41 AM, Kiran Ayyagari a écrit :
>>> On Mon, Sep 16, 2013 at 3:00 PM, Emmanuel Lécharny 
>>> wrote:
>>>
 Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit :
> On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny  wrote:
>
>> Hi guys,
>>
>>
>>/**
>> * @return The array of stored values.
>> */
>>V[] getValues();
>>
>> shouldn't this be returning a BTree? cause we don't support an
 array
> as a holder of
> multiple values
 The fact that it's stored as a BTree is implementation dependent, I'm
 not sure the user of the API should know about it. What the user wants
 is to get back the stored values, and this is convenient to get back an
 array.

 IMO, it is not convenient, we cannot copy all the values into an array
>>> and return, and returning an iterator is not going to help either, both will
>>> severely affect the search performance
>>> If needed, we can wrap it an immutable structure and return the BTree
>>> to prevent direct updates, but I wish to differ this until we see the impact
>>> on partition implementation.
>> Except that from the BTree interface, the one the users see, you have :
>>
>>public V get( K key ) throws IOException, KeyNotFoundException
>>public DuplicateKeyVal getValues( K key ) throws IOException,
>> KeyNotFoundException
>>
>> so the users has no clue about the underlying data structure.
>>
>> Now, I'm not sure if it makes any sense to expose a getValues() method
>> returning an array. May be an iterator is enough..
>>
>> -- 
>> Regards,
>> Cordialement,
>> Emmanuel Lécharny
>> www.iktek.com 
>>


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com 



Re: [Mavibot] Value storage

2013-09-16 Thread Pierre-Arnaud Marcelot
I would go with a simple Iterator.

Regards,
Pierre-Arnaud

On 16 sept. 2013, at 11:51, Emmanuel Lécharny  wrote:

> Le 9/16/13 11:41 AM, Kiran Ayyagari a écrit :
>> On Mon, Sep 16, 2013 at 3:00 PM, Emmanuel Lécharny 
>> wrote:
>> 
>>> Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit :
 On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny >>> wrote:
 
> Hi guys,
> 
> 
>/**
> * @return The array of stored values.
> */
>V[] getValues();
> 
> shouldn't this be returning a BTree? cause we don't support an
>>> array
 as a holder of
 multiple values
>>> The fact that it's stored as a BTree is implementation dependent, I'm
>>> not sure the user of the API should know about it. What the user wants
>>> is to get back the stored values, and this is convenient to get back an
>>> array.
>>> 
>>> IMO, it is not convenient, we cannot copy all the values into an array
>> and return, and returning an iterator is not going to help either, both will
>> severely affect the search performance
>> If needed, we can wrap it an immutable structure and return the BTree
>> to prevent direct updates, but I wish to differ this until we see the impact
>> on partition implementation.
> 
> Except that from the BTree interface, the one the users see, you have :
> 
>public V get( K key ) throws IOException, KeyNotFoundException
>public DuplicateKeyVal getValues( K key ) throws IOException,
> KeyNotFoundException
> 
> so the users has no clue about the underlying data structure.
> 
> Now, I'm not sure if it makes any sense to expose a getValues() method
> returning an array. May be an iterator is enough..
> 
> -- 
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com 
> 



Re: [Mavibot] Value storage

2013-09-16 Thread Emmanuel Lécharny
Le 9/16/13 11:41 AM, Kiran Ayyagari a écrit :
> On Mon, Sep 16, 2013 at 3:00 PM, Emmanuel Lécharny wrote:
>
>> Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit :
>>> On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny >> wrote:
>>>
 Hi guys,


 /**
  * @return The array of stored values.
  */
 V[] getValues();

 shouldn't this be returning a BTree? cause we don't support an
>> array
>>> as a holder of
>>> multiple values
>> The fact that it's stored as a BTree is implementation dependent, I'm
>> not sure the user of the API should know about it. What the user wants
>> is to get back the stored values, and this is convenient to get back an
>> array.
>>
>> IMO, it is not convenient, we cannot copy all the values into an array
> and return, and returning an iterator is not going to help either, both will
> severely affect the search performance
> If needed, we can wrap it an immutable structure and return the BTree
> to prevent direct updates, but I wish to differ this until we see the impact
> on partition implementation.

Except that from the BTree interface, the one the users see, you have :

public V get( K key ) throws IOException, KeyNotFoundException
public DuplicateKeyVal getValues( K key ) throws IOException,
KeyNotFoundException

so the users has no clue about the underlying data structure.

Now, I'm not sure if it makes any sense to expose a getValues() method
returning an array. May be an iterator is enough..

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com 



Re: [Mavibot] Value storage

2013-09-16 Thread Kiran Ayyagari
On Mon, Sep 16, 2013 at 3:00 PM, Emmanuel Lécharny wrote:

> Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit :
> > On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny  >wrote:
> >
> >> Hi guys,
> >>
> >>
> >> /**
> >>  * @return The array of stored values.
> >>  */
> >> V[] getValues();
> >>
> >> shouldn't this be returning a BTree? cause we don't support an
> array
> > as a holder of
> > multiple values
>
> The fact that it's stored as a BTree is implementation dependent, I'm
> not sure the user of the API should know about it. What the user wants
> is to get back the stored values, and this is convenient to get back an
> array.
>
> IMO, it is not convenient, we cannot copy all the values into an array
and return, and returning an iterator is not going to help either, both will
severely affect the search performance
If needed, we can wrap it an immutable structure and return the BTree
to prevent direct updates, but I wish to differ this until we see the impact
on partition implementation.

> Now, it's probably even better to make the Value class being an
> Iterable, exactly like the Attribute class. To get all the values, it's
> just a mater to iterate on the Value.
>
>
> Btw, for clarity, I'll rename the classes and interfaces to Element (and
> SingleElement, BtreeElement, etc).
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>
>


-- 
Kiran Ayyagari
http://keydap.com


Re: [Mavibot] Value storage

2013-09-16 Thread Emmanuel Lécharny
Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit :
> On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny wrote:
>
>> Hi guys,
>>
>>
>> /**
>>  * @return The array of stored values.
>>  */
>> V[] getValues();
>>
>> shouldn't this be returning a BTree? cause we don't support an array
> as a holder of
> multiple values

The fact that it's stored as a BTree is implementation dependent, I'm
not sure the user of the API should know about it. What the user wants
is to get back the stored values, and this is convenient to get back an
array.

Now, it's probably even better to make the Value class being an
Iterable, exactly like the Attribute class. To get all the values, it's
just a mater to iterate on the Value.


Btw, for clarity, I'll rename the classes and interfaces to Element (and
SingleElement, BtreeElement, etc).

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com 



Re: [Mavibot] Value storage

2013-09-14 Thread Kiran Ayyagari
On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny wrote:

> Hi guys,
>
> sorry for having been mute those last days, many side things to deal with.
>
> We had a big performance issue in mavibot when we store many elements in
> the btree, as we had to deserialize all the leaf's values when we update
> a leaf, which is really a bad idea. I was working on a fix for this, but
> I hit a wall : how do we manage the keys associated with more than one
> value ?
>
> Kiran has added a mechnism by which a sub-btree can be used to solve
> this case. This is fine, except that it does not work well with the new
> Holder hierarchy I'm using, so I'm now working on storing the values in
> a slight different way. Basically, all the values -either single ot
> multipke - will be stored into a encapsulating data structure. The
> ElementHolder.getValue() will return this encapsulatng objects, instead
> of the direct value.
>
> Here is the proposed interface :
>
> public interface Value
> {
> /**
>  * Tells if the value is single or contains a set of values
>  * @return True if we have one single value
>  */
> boolean isSingleValue();
>
>
> /**
>  * @return The stored value, or the first value if we have more than
> one
>  */
> V getValue();
>
>
> /**
>  * @return The array of stored values.
>  */
> V[] getValues();
>
> shouldn't this be returning a BTree? cause we don't support an array
as a holder of
multiple values

>
> /**
>  * Add a new value
>  * @param value The added value
>  */
> void addValue( V value );
>
>
> /**
>  * Delete a given value
>  * @param value The value to delete
>  */
> void deleteValue( V value );
>
>
> /**
>  * Remove all the values
>  */
> void clear();
> }
>
>
> The hierarchy will be :
>
> (Value)
>o
>|
>+-- [[AbstractValue]]
>^   ^
>|   |
>|   +-- [SingleValue]
>+-- [BTreeValue]
>
> This is a first round, I have to come with more details later.
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>
>


-- 
Kiran Ayyagari
http://keydap.com


[Mavibot] Value storage

2013-09-14 Thread Emmanuel Lécharny
Hi guys,

sorry for having been mute those last days, many side things to deal with.

We had a big performance issue in mavibot when we store many elements in
the btree, as we had to deserialize all the leaf's values when we update
a leaf, which is really a bad idea. I was working on a fix for this, but
I hit a wall : how do we manage the keys associated with more than one
value ?

Kiran has added a mechnism by which a sub-btree can be used to solve
this case. This is fine, except that it does not work well with the new
Holder hierarchy I'm using, so I'm now working on storing the values in
a slight different way. Basically, all the values -either single ot
multipke - will be stored into a encapsulating data structure. The
ElementHolder.getValue() will return this encapsulatng objects, instead
of the direct value.

Here is the proposed interface :

public interface Value
{
/**
 * Tells if the value is single or contains a set of values
 * @return True if we have one single value
 */
boolean isSingleValue();


/**
 * @return The stored value, or the first value if we have more than one
 */
V getValue();


/**
 * @return The array of stored values.
 */
V[] getValues();


/**
 * Add a new value
 * @param value The added value
 */
void addValue( V value );


/**
 * Delete a given value
 * @param value The value to delete
 */
void deleteValue( V value );


/**
 * Remove all the values
 */
void clear();
}


The hierarchy will be :

(Value)
   o
   |
   +-- [[AbstractValue]]
   ^   ^
   |   |
   |   +-- [SingleValue]
   +-- [BTreeValue]

This is a first round, I have to come with more details later.

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com