> On Dec 14, 2015, at 7:09 PM, Till Westmann <[email protected]> wrote: > > On 14 Dec 2015, at 18:55, Murtadha Hubail wrote: > >> I think the backward compatibility discussion goes beyond metadata indexes >> and a complete plan that considers everything in storage should be developed >> to support upgrading and patching. Just as an example when we did the >> repacking from edu.uci to org.apache, all existing instances on edu.uci >> wouldn’t work on new binaries due to Java serialization on edu.uci classes. > > Good point. Do you know if we fixed that or did we just leave it as-is? >
It is still as is for LocalResource class but it is in my TODO queue. Even this change will break all existing instances and will require adding a serialization version attribute in the new serialization format to support backward compatibility. >> Having said that, I would go with the right long term solution for metadata >> indexes which would’ve been a result of the backward compatibility plan if >> we had one. > > I tend to agree here. I think that we’ll need a backwards compatibility > story, even if we choose to be schema-less for all metadata. > 1) Even if the metadata is all flexible, we’ll be able to read the old > metadata, but we’ll need to keep code around to read all versions of the > metadata. > 2) If we need to change the file format for the data we’ll also need a way to > realize that (and that would probably affect the metadata as well). > > I think that it might be a good start to add version identifiers to persisted > data structures, so that we’d at least be able to distinguish different > versions (and potentially have the ability to provide some migration - of > needed). Agreed. > > Thoughts? > > Cheers, > Till > >>> On Dec 14, 2015, at 6:19 PM, Ildar Absalyamov <[email protected]> >>> wrote: >>> >>> As for general topic of backwards compatibility I think going “fully open” >>> might be the best longterm solution. >>> Once in a while the topic of changing metadata keeps reappearing and there >>> is no guarantee it will not strike back again. Opening up metadata will >>> release ourselves from burden of producing migration tools and shipping >>> them with the new version of the binaries with revised catalog. >>> The performance (mainly storage) impacts of that solution will be tolerable >>> especially considering how much data is usually stored in metadata. >>> Moreover, being big proponents of semi-structured data, it does make >>> perfect sense for us to eat our own dog food here. >>> >>>> On Dec 14, 2015, at 18:04, Ildar Absalyamov <[email protected]> >>>> wrote: >>>> >>>> I guess the main argument for 2 would be eliminating broken metadata >>>> records prior to backwards compatibility cutoff. >>>> The last thing what we want to do is to be stuck with wrong implementation >>>> for compatibility reasons. Once the functionality needed for 3 is there we >>>> can again introduce those indexes without building sophisticated migration >>>> subsystem. >>>> >>>>> On Dec 14, 2015, at 17:55, Mike Carey <[email protected]> wrote: >>>>> >>>>> SO - it seems like 3 is the right long-term answer, but not doable now? >>>>> (If it was doable now, it would obviously be the ideal choice of the >>>>> three.) >>>>> What would be the argument for doing 2 as opposed to 1 for now? >>>>> As for the question of backwards compatibility, I actually didn't sense a >>>>> consensus yet. >>>>> I would tentatively lean towards "right" over "backwards compatible" for >>>>> this change. >>>>> What are others thoughts on that? >>>>> (Soon we won't have that luxury, but right now maybe we do?) >>>>> >>>>> On 12/14/15 3:43 PM, Steven Jacobs wrote: >>>>>> We just had a UCR discussion on this topic. The issue is really with the >>>>>> third "index" here. The code now is using one "index" to go in two >>>>>> directions: >>>>>> 1) To find datatypes that use datatype A >>>>>> 2) To find datatypes that are used by datatype A. >>>>>> >>>>>> The way that it works now is hacked together, but designed for >>>>>> performance. >>>>>> So we have three choices here: >>>>>> >>>>>> 1) Stick to the status quo, and leave the "indexes" as they are >>>>>> 2) Remove the Metadata secondary indexes, which will eliminate the hack >>>>>> but >>>>>> cost some performance on Metadata >>>>>> 3) Implement the Metadata secondary indexes correctly as Asterix indexes. >>>>>> For this solution to work with our dataset designs, we will need to have >>>>>> the ability to index homogeneous lists. In addition, we will have reverse >>>>>> compatibility issues unless we plan things out for the transition. >>>>>> >>>>>> What are the thoughts? >>>>>> >>>>>> >>>>>> Orthogonally, it seems that the consensus for storing the datatype >>>>>> dataverse in the dataset Metadata is to just add it as an open field at >>>>>> least for now. Is that correct? >>>>>> >>>>>> Steven >>>>>> >>>>>> >>>>>> On Mon, Dec 14, 2015 at 1:23 PM, Mike Carey <[email protected]> wrote: >>>>>> >>>>>>> Thoughts inlined: >>>>>>> >>>>>>> On 12/14/15 11:12 AM, Steven Jacobs wrote: >>>>>>> >>>>>>>> Here are the conclusions that Ildar and I have drawn from looking at >>>>>>>> the >>>>>>>> secondary indexes: >>>>>>>> >>>>>>>> First of all it seems that datasets are local to node groups, but >>>>>>>> dataverses can span node groups, which seems a little odd to me. >>>>>>>> >>>>>>> Node groups are an undocumented but to-be-exploited-someday feature that >>>>>>> allows datasets to be stored on less than all nodes in a given cluster. >>>>>>> As >>>>>>> we face bigger clusters, we'll want to open up that possibility. We >>>>>>> will >>>>>>> hopefully use them inside w/o having to make users manage them manually >>>>>>> like parallel DB2 did/does. Dataverses are really just a namespace >>>>>>> thing, >>>>>>> not a storage thing at all, so they are orthogonal to (and unrelated to) >>>>>>> node groups. >>>>>>> >>>>>>>> There are three Metadata secondary indexes: >>>>>>>> GROUPNAME_ON_DATASET_INDEX, >>>>>>>> DATATYPENAME_ON_DATASET_INDEX, DATATYPENAME_ON_DATATYPE_INDEX >>>>>>>> >>>>>>>> The first is used in only one case: >>>>>>>> When dropping a node group, check if there are any datasets using this >>>>>>>> node >>>>>>>> group. If so, don't allow the drop >>>>>>>> BUT, this index has a field called "dataverse" which is not used at >>>>>>>> all. >>>>>>>> >>>>>>> This one seems like a waste of space since we do this almost never. (Not >>>>>>> much space, but unnecessary.) If we keep it it should become a proper >>>>>>> index. >>>>>>> >>>>>>>> The second is used when dropping a datatype. If there is a dataset >>>>>>>> using >>>>>>>> this datatype, don't allow the drop. >>>>>>>> Similarly, this index has a "dataverse" which is never used. >>>>>>>> >>>>>>> You're about to use the dataverse part, right? :-) This index seems >>>>>>> like >>>>>>> it will be useful but should be a proper index. >>>>>>> >>>>>>>> The third index is used to go in two cases, using two different ideas >>>>>>>> of >>>>>>>> "keys" >>>>>>>> It seems like this should actually be two different indexes. >>>>>>>> >>>>>>> I don't think I understood this comment.... >>>>>>> >>>>>>> >>>>>>>> This is my understanding so far. It would be good to discuss what the >>>>>>>> "correct" version should be. >>>>>>>> Steven >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Dec 14, 2015 at 10:12 AM, Steven Jacobs <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>>> I'm implementing a change so that datasets can use datatypes from >>>>>>>>> alternate data verses (previously the type and set had to be from the >>>>>>>>> same >>>>>>>>> dataverse). Unfortunately this means another change for Dataset >>>>>>>>> Metadata >>>>>>>>> (which will now store the dataverse for its type). >>>>>>>>> >>>>>>>>> As such, I had a couple of questions: >>>>>>>>> >>>>>>>>> 1) Should this change be thrown into the release branch, as it is >>>>>>>>> another >>>>>>>>> Metadata change? >>>>>>>>> >>>>>>>>> 2) In implementing this change, I've been looking at the Metadata >>>>>>>>> secondary indexes. I had a discussion with Ildar, and it seems the >>>>>>>>> thread >>>>>>>>> on Metadata secondary indexes being "hacked" has been lost. Is this >>>>>>>>> also >>>>>>>>> something that should get into the release? Is there anyone currently >>>>>>>>> looking at it? >>>>>>>>> >>>>>>>>> Steven >>>>>>>>> >>>>>>>>> >>>>> >>>> >>>> Best regards, >>>> Ildar >>>> >>> >>> Best regards, >>> Ildar >>>
