On Thu, Apr 5, 2018 at 10:22 PM, Hanumath Rao Maduri <hanu....@gmail.com> wrote:
> ... > > Thank you Ted for your valuable suggestions, as regards to your comment on > "metastore is good but centralized is bad" can you please share your view > point on what all design issues it can cause. I know that it can be > bottleneck but just want to know about other issues. Put in other terms if centralized metastore engineered in a good way to > avoid most of the bottleneck, then do you think it can be good to use for > metadata? > Centralized metadata stores have caused the following problems in my experience: 1) they lock versions and make it extremely hard to upgrade applications incrementally. It is a common fiction that one can upgrade all applications using the same data at the same moment. It isn't acceptable to require an outage and force an upgrade on users. It also isn't acceptable to force the metadata store to never be updated. 2) they go down and take everything else with it. 3) they require elaborate caching. The error message "updating metadata cache" was the most common string on the impala mailing list for a long time because of the 30 minute delays that customers were seeing due to this kind of problem. 4) they limit expressivity. Because it is hard to update a metadata store safely, they move slowly and typically don't describe new data well. Thus, Hive metadata store doesn't deal with variable typed data or structured data worth a darn. The same thing will happen with any new centralized meta-data store. 5) they inhibit multi-tenancy. Ideally, data describes itself so that different users can see the same data even if they are nominally not part of the same org or sub-org. 6) they inhibit data fabrics that extend beyond a single cluster. Centralized metadata stores are inherently anti-global. Self-describing data, on the other hand, is inherently global since whereever the data goes, so goes the metadata. Note that self-describing data does not have to be intrinsically self-descriptive in a single file. I view JSON file with a schema file alongside as a self-describing pair. As an example, imagine that file extensions were tied to applications by a central authority (a metadata store). This would mean that you couldn't change web browsers (.html) or spreadhsheets. Or compilers. And frankly, the fact that my computer has a single idea about how a file is interpreted is limiting. I would prefer to use photoshop on images in certain directories and Preview for other images elsewhere. A single repository linking file type to application is too limiting even on my laptop. That is the same issue, ultimately, as a centralized data store except that my issues with images are tiny compared to the problems that occur when you have 5000 analysts working on data that all get screwed by a single broken piece of software.