Re: Metadata changes

Murtadha Hubail Mon, 14 Dec 2015 19:31:16 -0800

> On Dec 14, 2015, at 7:09 PM, Till Westmann <[email protected]> wrote:
> 
> On 14 Dec 2015, at 18:55, Murtadha Hubail wrote:
> 
>> I think the backward compatibility discussion goes beyond metadata indexes 
>> and a complete plan that considers everything in storage should be developed 
>> to support upgrading and patching. Just as an example when we did the 
>> repacking from edu.uci to org.apache, all existing instances on edu.uci 
>> wouldn’t work on new binaries due to Java serialization on edu.uci classes.
> 
> Good point. Do you know if we fixed that or did we just leave it as-is?
>


It is still as is for LocalResource class but it is in my TODO queue. Even this 
change will break all existing instances and will require adding a 
serialization version attribute in the new serialization format to support 
backward compatibility.

>> Having said that, I would go with the right long term solution for metadata 
>> indexes which would’ve been a result of the backward compatibility plan if 
>> we had one.
> 
> I tend to agree here. I think that we’ll need a backwards compatibility 
> story, even if we choose to be schema-less for all metadata.
> 1) Even if the metadata is all flexible, we’ll be able to read the old 
> metadata, but we’ll need to keep code around to read all versions of the 
> metadata.
> 2) If we need to change the file format for the data we’ll also need a way to 
> realize that (and that would probably affect the metadata as well).
> 
> I think that it might be a good start to add version identifiers to persisted 
> data structures, so that we’d at least be able to distinguish different 
> versions (and potentially have the ability to provide some migration - of 
> needed).

Agreed.

> 
> Thoughts?
> 
> Cheers,
> Till
> 
>>> On Dec 14, 2015, at 6:19 PM, Ildar Absalyamov <[email protected]> 
>>> wrote:
>>> 
>>> As for general topic of backwards compatibility I think going “fully open” 
>>> might be the best longterm solution.
>>> Once in a while the topic of changing metadata keeps reappearing and there 
>>> is no guarantee it will not strike back again. Opening up metadata will 
>>> release ourselves from burden of producing migration tools and shipping 
>>> them with the new version of the binaries with revised catalog.
>>> The performance (mainly storage) impacts of that solution will be tolerable 
>>> especially considering how much data is usually stored in metadata.
>>> Moreover, being big proponents of semi-structured data, it does make 
>>> perfect sense for us to eat our own dog food here.
>>> 
>>>> On Dec 14, 2015, at 18:04, Ildar Absalyamov <[email protected]> 
>>>> wrote:
>>>> 
>>>> I guess the main argument for 2 would be eliminating broken metadata 
>>>> records prior to backwards compatibility cutoff.
>>>> The last thing what we want to do is to be stuck with wrong implementation 
>>>> for compatibility reasons. Once the functionality needed for 3 is there we 
>>>> can again introduce those indexes without building sophisticated migration 
>>>> subsystem.
>>>> 
>>>>> On Dec 14, 2015, at 17:55, Mike Carey <[email protected]> wrote:
>>>>> 
>>>>> SO - it seems like 3 is the right long-term answer, but not doable now?
>>>>> (If it was doable now, it would obviously be the ideal choice of the 
>>>>> three.)
>>>>> What would be the argument for doing 2 as opposed to 1 for now?
>>>>> As for the question of backwards compatibility, I actually didn't sense a 
>>>>> consensus yet.
>>>>> I would tentatively lean towards "right" over "backwards compatible" for 
>>>>> this change.
>>>>> What are others thoughts on that?
>>>>> (Soon we won't have that luxury, but right now maybe we do?)
>>>>> 
>>>>> On 12/14/15 3:43 PM, Steven Jacobs wrote:
>>>>>> We just had a UCR discussion on this topic. The issue is really with the
>>>>>> third "index" here. The code now is using one "index" to go in two
>>>>>> directions:
>>>>>> 1) To find datatypes that use datatype A
>>>>>> 2) To find datatypes that are used by datatype A.
>>>>>> 
>>>>>> The way that it works now is hacked together, but designed for 
>>>>>> performance.
>>>>>> So we have three choices here:
>>>>>> 
>>>>>> 1) Stick to the status quo, and leave the "indexes" as they are
>>>>>> 2) Remove the Metadata secondary indexes, which will eliminate the hack 
>>>>>> but
>>>>>> cost some performance on Metadata
>>>>>> 3) Implement the Metadata secondary indexes correctly as Asterix indexes.
>>>>>> For this solution to work with our dataset designs, we will need to have
>>>>>> the ability to index homogeneous lists. In addition, we will have reverse
>>>>>> compatibility issues unless we plan things out for the transition.
>>>>>> 
>>>>>> What are the thoughts?
>>>>>> 
>>>>>> 
>>>>>> Orthogonally, it seems that the consensus for storing the datatype
>>>>>> dataverse in the dataset Metadata is to just add it as an open field at
>>>>>> least for now. Is that correct?
>>>>>> 
>>>>>> Steven
>>>>>> 
>>>>>> 
>>>>>> On Mon, Dec 14, 2015 at 1:23 PM, Mike Carey <[email protected]> wrote:
>>>>>> 
>>>>>>> Thoughts inlined:
>>>>>>> 
>>>>>>> On 12/14/15 11:12 AM, Steven Jacobs wrote:
>>>>>>> 
>>>>>>>> Here are the conclusions that Ildar and I have drawn from looking at 
>>>>>>>> the
>>>>>>>> secondary indexes:
>>>>>>>> 
>>>>>>>> First of all it seems that datasets are local to node groups, but
>>>>>>>> dataverses can span node groups, which seems a little odd to me.
>>>>>>>> 
>>>>>>> Node groups are an undocumented but to-be-exploited-someday feature that
>>>>>>> allows datasets to be stored on less than all nodes in a given cluster. 
>>>>>>>  As
>>>>>>> we face bigger clusters, we'll want to open up that possibility.  We 
>>>>>>> will
>>>>>>> hopefully use them inside w/o having to make users manage them manually
>>>>>>> like parallel DB2 did/does.  Dataverses are really just a namespace 
>>>>>>> thing,
>>>>>>> not a storage thing at all, so they are orthogonal to (and unrelated to)
>>>>>>> node groups.
>>>>>>> 
>>>>>>>> There are three Metadata secondary indexes:  
>>>>>>>> GROUPNAME_ON_DATASET_INDEX,
>>>>>>>> DATATYPENAME_ON_DATASET_INDEX, DATATYPENAME_ON_DATATYPE_INDEX
>>>>>>>> 
>>>>>>>> The first is used in only one case:
>>>>>>>> When dropping a node group, check if there are any datasets using this
>>>>>>>> node
>>>>>>>> group. If so, don't allow the drop
>>>>>>>> BUT, this index has a field called "dataverse" which is not used at 
>>>>>>>> all.
>>>>>>>> 
>>>>>>> This one seems like a waste of space since we do this almost never. (Not
>>>>>>> much space, but unnecessary.)  If we keep it it should become a proper
>>>>>>> index.
>>>>>>> 
>>>>>>>> The second is used when dropping a datatype. If there is a dataset 
>>>>>>>> using
>>>>>>>> this datatype, don't allow the drop.
>>>>>>>> Similarly, this index has a "dataverse" which is never used.
>>>>>>>> 
>>>>>>> You're about to use the dataverse part, right?  :-)  This index seems 
>>>>>>> like
>>>>>>> it will be useful but should be a proper index.
>>>>>>> 
>>>>>>>> The third index is used to go in two cases, using two different ideas 
>>>>>>>> of
>>>>>>>> "keys"
>>>>>>>> It seems like this should actually be two different indexes.
>>>>>>>> 
>>>>>>> I don't think I understood this comment....
>>>>>>> 
>>>>>>> 
>>>>>>>> This is my understanding so far. It would be good to discuss what the
>>>>>>>> "correct" version should be.
>>>>>>>> Steven
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Dec 14, 2015 at 10:12 AM, Steven Jacobs <[email protected]> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>>> I'm implementing a change so that datasets can use datatypes from
>>>>>>>>> alternate data verses (previously the type and set had to be from the
>>>>>>>>> same
>>>>>>>>> dataverse). Unfortunately this means another change for Dataset 
>>>>>>>>> Metadata
>>>>>>>>> (which will now store the dataverse for its type).
>>>>>>>>> 
>>>>>>>>> As such, I had a couple of questions:
>>>>>>>>> 
>>>>>>>>> 1) Should this change be thrown into the release branch, as it is 
>>>>>>>>> another
>>>>>>>>> Metadata change?
>>>>>>>>> 
>>>>>>>>> 2) In implementing this change, I've been looking at the Metadata
>>>>>>>>> secondary indexes. I had a discussion with Ildar, and it seems the 
>>>>>>>>> thread
>>>>>>>>> on Metadata secondary indexes being "hacked" has been lost. Is this 
>>>>>>>>> also
>>>>>>>>> something that should get into the release? Is there anyone currently
>>>>>>>>> looking at it?
>>>>>>>>> 
>>>>>>>>> Steven
>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>>>> 
>>>> Best regards,
>>>> Ildar
>>>> 
>>> 
>>> Best regards,
>>> Ildar
>>>

Re: Metadata changes

Reply via email to