[ https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15045576#comment-15045576 ]
Allen Wittenauer commented on HADOOP-12620: ------------------------------------------- Given that every update to this JIRA is going to send the body of the description out, can we move the majority of the body out to a comment or an attachment? Thanks. > Advanced Hadoop Architecture (AHA) - Common > ------------------------------------------- > > Key: HADOOP-12620 > URL: https://issues.apache.org/jira/browse/HADOOP-12620 > Project: Hadoop Common > Issue Type: New Feature > Reporter: Dinesh S. Atreya > > h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA) > One main motivation for this JIRA is to address a comprehensive set of uses > with just minimal enhancements to Hadoop to transition Hadoop from a Modern > Data Architecture to Advanced/Cloud Data Architecture. > HDFS has traditionally had a write-once-read-many access model for files > until the introduction of “Append to files in HDFS” capability. The next > minimal enhancements to core Hadoop include capability to do > “updates-in-place” in HDFS. > • Support seeks for writes (in addition to reads). > • After seek, if the new byte length is the same as the old byte length, > in place update is allowed. > • Delete is an update with appropriate Delete marker > • If byte length is different, old entry is marked as delete with new one > appended as before. > • It is client’s discretion to perform either update, append or both and > the API changes in different Hadoop components should provide these > capabilities. > These minimal changes will enable laying the basis for transforming the core > Hadoop to an interactive and real-time platform and introducing significant > native capabilities to Hadoop. These enhancements will lay a foundation for > all of the following processing styles to be supported natively and > dynamically. > • Real time > • Mini-batch > • Stream based data processing > • Batch – which is the default now. > Hadoop engines can dynamically choose processing style to use based on the > type of data and volume of data sets and enhance/replace prevailing > approaches. > With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O > resources with increasing efficiency. The Hadoop task engines can use > vectorized/pipelined processing and greater use of memory throughout the > Hadoop platform. > These will enable enhanced performance optimizations to be implemented in > HDFS and made available to all the Hadoop components. This will enable Fast > processing of Big Data and enhance all the characteristics volume, velocity > and variety of big data. > There are many influences for this umbrella JIRA: > • Preserve and Accelerate Hadoop > • Efficient Data Management of variety of Data Formats natively in Hadoop > • Enterprise Expansion > • Internet and Media > • Databases offer native support for a variety of Data Formats such as > JSON, XML Indexes, and Temporal etc. – Hadoop should do the same. > It is quite probable that there may be many sub-JIRAs created to address > portions of this. This JIRA captures a variety of use-cases in one place. > Some Data Management /Platform initial use-cases are given hereunder: > h2. Key-Value Store > With the proposed enhancements, it will become very convenient to implement > Key-Value Store natively in Hadoop. > h2. MVCC > Modified example of how MVCC can be implemented with the proposed > enhancements from PostgreSQL MVCC is given hereunder. > https://wiki.postgresql.org/wiki/MVCC > http://momjian.us/main/writings/pgsql/mvcc.pdf > || Data ID || Activity || Data Create || Data Expiry || Comments > || || || Counter || Counter || Comments > | 1 | Insert | 40 | MAX_VAL | Conventionally MAX_VAL is null. > In order to maintain update size, MAX_VAL is pre-seeded for our purposes. > | 1 | Delete | 40 | 47 | Marked as delete when current counter > was 47. > | 2 | Update (old Delete) | 64 | 78 | Mark old data is DELETE > | 2 | Update (new insert) | 78 | MAX_VAL | Insert new data. > Graph Stores > Enable native storage and processing for a variety of graph stores. > Graph Store 1 (Spark GraphX) > 1. EdgeTable(pid, src, dst, data): stores the adjacency > structure and edge data. Each edge is represented as a > tuple consisting of the source vertex id, destination vertex id, > and user-defined data as well as a virtual partition identifier > (pid). Note that the edge table contains only the vertex ids > and not the vertex data. The edge table is partitioned by the > pid > 2. VertexDataTable(id, data): stores the vertex data, > in the form of a vertex (id, data) pairs. The vertex data table > is indexed and partitioned by the vertex id. > 3. VertexMap(id, pid): provides a mapping from the id > of a vertex to the ids of the virtual partitions that contain > adjacent edges. > Graph Store 2 (Facebook Social Graph - TAO) > Object: (id) → (otype,(key → value)∗ ) > Assoc.: (id1,atype,id2) → (time,(key → value) ∗ ) > WEB > With the AHA enhancements, a variety of Web standards can be natively > supported such as updateable JSON (http://json.org/), XML, RDF and other > documents. > RDF > RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ > RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ > The simplest triple statement is a sequence of (subject, predicate, object) > terms, separated by whitespace and terminated by '.' after each triple. > Mobile Apps Data and Resources > With the enhancements proposed, in addition to the Web, Apps Data and > Resources can also be managed using the Hadoop . Some examples of such usage > can include App Data and Resources for Apple and other App stores. > About Apps Resources: > https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html > > On-Demand Resources Essentials: > https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/ > > Resource Programming Guide: > https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf > > Temporal Data > https://en.wikipedia.org/wiki/Temporal_database > https://en.wikipedia.org/wiki/Valid_time > In temporal data, data may get updated to reflect changes in data. > For example data change from > Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) > Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001) > to > Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) > Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995) > Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000) > Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001) > Media > Media production typically involves a lot of changes and updates prior to > release. The enhancements will lay a basis for the full lifecycle to be > managed in Hadoop ecosystem. > Indexes > With the changes, a variety of updatable indexes can be supported natively in > Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn > leverage Hadoop’s enhanced native capabilities. > Natural Support for ETL and Analytics > With native support for updates and deletes in addition to appends/inserts, > Hadoop will have proper and natural support for ETL and Analytics. > Google References > While Google’s research in this area is interesting (and some extracts are > listed hereunder), the evolution of Hadoop is quite interesting. Proposed > enhancements to support in-place-update to the core Hadoop will enable and > make it easier for a variety of enhancements for each of the Hadoop > components. > We propose a basis for allowing a system for incrementally processing updates > to large data sets and reduce the overhead of always having to do large > batches. Hadoop engines can dynamically choose processing style to use based > on the type of data and volume of data sets and enhance/replace prevailing > approaches. > Year Title Links > 2015 Announcing Google Cloud Bigtable: The same database that powers Google > Search, Gmail and Analytics is now available on Google Cloud Platform > http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html > https://cloud.google.com/bigtable/ > 2014 Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing > http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf > > 2013 F1: A Distributed SQL Database That Scales > http://research.google.com/pubs/pub41344.html > 2013 Online, Asynchronous Schema Change in F1 > http://research.google.com/pubs/pub41376.html > 2013 Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams > http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf > > 2012 F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad > Business http://research.google.com/pubs/pub38125.html > 2012 Spanner: Google's Globally-Distributed Database > http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf > > 2012 Clydesdale: structured data processing on MapReduce > http://dl.acm.org/citation.cfm?doid=2247596.2247600 > 2011 Megastore: Providing Scalable, Highly Available Storage for Interactive > Services http://research.google.com/pubs/pub36971.html > 2011 Tenzing A SQL Implementation On The MapReduce Framework > http://research.google.com/pubs/pub37200.html > 2010 Dremel: Interactive Analysis of Web-Scale Datasets > http://research.google.com/pubs/pub36632.html > 2010 FlumeJava: Easy, Efficient Data-Parallel Pipelines > http://research.google.com/pubs/pub35650.html > 2010 Percolator: Large-scale Incremental Processing Using Distributed > Transactions and Notifications http://research.google.com/pubs/pub36726.html > https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf > Application Domains > The enhancements will lay a path for comprehensive support of all application > domains in Hadoop. A small collection is given hereunder. > Data Warehousing and Enhanced ETL processing > Supply Chain Planning > Web Sites > Mobile App Stores > Financials > Media > Machine Learning > Social Media > Enterprise Applications such as ERP, CRM > Corresponding umbrella JIRAs can be found for each of the following Hadoop > platform components. -- This message was sent by Atlassian JIRA (v6.3.4#6332)