[ https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dinesh S. Atreya updated HADOOP-12620: -------------------------------------- Description: h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA) One main motivation for this JIRA is to address a comprehensive set of uses with just minimal enhancements to Hadoop to transition Hadoop from a Modern Data Architecture to Advanced/Cloud Data Architecture. HDFS has traditionally had a write-once-read-many access model for files until the introduction of “Append to files in HDFS” capability. The next minimal enhancements to core Hadoop include capability to do “updates-in-place” in HDFS. • Support seeks for writes (in addition to reads). • After seek, if the new byte length is the same as the old byte length, in place update is allowed. • Delete is an update with appropriate Delete marker • If byte length is different, old entry is marked as delete with new one appended as before. • It is client’s discretion to perform either update, append or both and the API changes in different Hadoop components should provide these capabilities. These minimal changes will enable laying the basis for transforming the core Hadoop to an interactive and real-time platform and introducing significant native capabilities to Hadoop. These enhancements will lay a foundation for all of the following processing styles to be supported natively and dynamically. • Real time • Mini-batch • Stream based data processing • Batch – which is the default now. Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches. With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O resources with increasing efficiency. The Hadoop task engines can use vectorized/pipelined processing and greater use of memory throughout the Hadoop platform. These will enable enhanced performance optimizations to be implemented in HDFS and made available to all the Hadoop components. This will enable Fast processing of Big Data and enhance all the characteristics volume, velocity and variety of big data. There are many influences for this umbrella JIRA: • Preserve and Accelerate Hadoop • Efficient Data Management of variety of Data Formats natively in Hadoop • Enterprise Expansion • Internet and Media • Databases offer native support for a variety of Data Formats such as JSON, XML Indexes, and Temporal etc. – Hadoop should do the same. It is quite probable that there may be many sub-JIRAs created to address portions of this. This JIRA captures a variety of use-cases in one place. Some Data Management /Platform initial use-cases are given hereunder: h2. Key-Value Store With the proposed enhancements, it will become very convenient to implement Key-Value Store natively in Hadoop. h2. MVCC Modified example of how MVCC can be implemented with the proposed enhancements from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC http://momjian.us/main/writings/pgsql/mvcc.pdf || Data ID || Activity || Data Create || Data Expiry || Comments || || || Counter || Counter || Comments | 1 | Insert | 40 | MAX_VAL | Conventionally MAX_VAL is null. In order to maintain update size, MAX_VAL is pre-seeded for our purposes. | 1 | Delete | 40 | 47 | Marked as delete when current counter was 47. | 2 | Update (old Delete) | 64 | 78 | Mark old data is DELETE | 2 | Update (new insert) | 78 | MAX_VAL | Insert new data. Graph Stores Enable native storage and processing for a variety of graph stores. Graph Store 1 (Spark GraphX) 1. EdgeTable(pid, src, dst, data): stores the adjacency structure and edge data. Each edge is represented as a tuple consisting of the source vertex id, destination vertex id, and user-defined data as well as a virtual partition identifier (pid). Note that the edge table contains only the vertex ids and not the vertex data. The edge table is partitioned by the pid 2. VertexDataTable(id, data): stores the vertex data, in the form of a vertex (id, data) pairs. The vertex data table is indexed and partitioned by the vertex id. 3. VertexMap(id, pid): provides a mapping from the id of a vertex to the ids of the virtual partitions that contain adjacent edges. Graph Store 2 (Facebook Social Graph - TAO) Object: (id) → (otype,(key → value)∗ ) Assoc.: (id1,atype,id2) → (time,(key → value) ∗ ) WEB With the AHA enhancements, a variety of Web standards can be natively supported such as updateable JSON (http://json.org/), XML, RDF and other documents. RDF RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ The simplest triple statement is a sequence of (subject, predicate, object) terms, separated by whitespace and terminated by '.' after each triple. Mobile Apps Data and Resources With the enhancements proposed, in addition to the Web, Apps Data and Resources can also be managed using the Hadoop . Some examples of such usage can include App Data and Resources for Apple and other App stores. About Apps Resources: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html On-Demand Resources Essentials: https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/ Resource Programming Guide: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf Temporal Data https://en.wikipedia.org/wiki/Temporal_database https://en.wikipedia.org/wiki/Valid_time In temporal data, data may get updated to reflect changes in data. For example data change from Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001) to Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995) Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000) Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001) Media Media production typically involves a lot of changes and updates prior to release. The enhancements will lay a basis for the full lifecycle to be managed in Hadoop ecosystem. Indexes With the changes, a variety of updatable indexes can be supported natively in Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn leverage Hadoop’s enhanced native capabilities. Natural Support for ETL and Analytics With native support for updates and deletes in addition to appends/inserts, Hadoop will have proper and natural support for ETL and Analytics. Google References While Google’s research in this area is interesting (and some extracts are listed hereunder), the evolution of Hadoop is quite interesting. Proposed enhancements to support in-place-update to the core Hadoop will enable and make it easier for a variety of enhancements for each of the Hadoop components. We propose a basis for allowing a system for incrementally processing updates to large data sets and reduce the overhead of always having to do large batches. Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches. Year Title Links 2015 Announcing Google Cloud Bigtable: The same database that powers Google Search, Gmail and Analytics is now available on Google Cloud Platform http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html https://cloud.google.com/bigtable/ 2014 Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf 2013 F1: A Distributed SQL Database That Scales http://research.google.com/pubs/pub41344.html 2013 Online, Asynchronous Schema Change in F1 http://research.google.com/pubs/pub41376.html 2013 Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf 2012 F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business http://research.google.com/pubs/pub38125.html 2012 Spanner: Google's Globally-Distributed Database http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf 2012 Clydesdale: structured data processing on MapReduce http://dl.acm.org/citation.cfm?doid=2247596.2247600 2011 Megastore: Providing Scalable, Highly Available Storage for Interactive Services http://research.google.com/pubs/pub36971.html 2011 Tenzing A SQL Implementation On The MapReduce Framework http://research.google.com/pubs/pub37200.html 2010 Dremel: Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html 2010 FlumeJava: Easy, Efficient Data-Parallel Pipelines http://research.google.com/pubs/pub35650.html 2010 Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf Application Domains The enhancements will lay a path for comprehensive support of all application domains in Hadoop. A small collection is given hereunder. Data Warehousing and Enhanced ETL processing Supply Chain Planning Web Sites Mobile App Stores Financials Media Machine Learning Social Media Enterprise Applications such as ERP, CRM Corresponding umbrella JIRAs can be found for each of the following Hadoop platform components. was: h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA) One main motivation for this JIRA is to address a comprehensive set of uses with just minimal enhancements to Hadoop to transition Hadoop from a Modern Data Architecture to Advanced/Cloud Data Architecture. HDFS has traditionally had a write-once-read-many access model for files until the introduction of “Append to files in HDFS” capability. The next minimal enhancements to core Hadoop include capability to do “updates-in-place” in HDFS. • Support seeks for writes (in addition to reads). • After seek, if the new byte length is the same as the old byte length, in place update is allowed. • Delete is an update with appropriate Delete marker • If byte length is different, old entry is marked as delete with new one appended as before. • It is client’s discretion to perform either update, append or both and the API changes in different Hadoop components should provide these capabilities. These minimal changes will enable laying the basis for transforming the core Hadoop to an interactive and real-time platform and introducing significant native capabilities to Hadoop. These enhancements will lay a foundation for all of the following processing styles to be supported natively and dynamically. • Real time • Mini-batch • Stream based data processing • Batch – which is the default now. Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches. With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O resources with increasing efficiency. The Hadoop task engines can use vectorized/pipelined processing and greater use of memory throughout the Hadoop platform. These will enable enhanced performance optimizations to be implemented in HDFS and made available to all the Hadoop components. This will enable Fast processing of Big Data and enhance all the characteristics volume, velocity and variety of big data. There are many influences for this umbrella JIRA: • Preserve and Accelerate Hadoop • Efficient Data Management of variety of Data Formats natively in Hadoop • Enterprise Expansion • Internet and Media • Databases offer native support for a variety of Data Formats such as JSON, XML Indexes, and Temporal etc. – Hadoop should do the same. It is quite probable that there may be many sub-JIRAs created to address portions of this. This JIRA captures a variety of use-cases in one place. Some Data Management /Platform initial use-cases are given hereunder: Key-Value Store With the proposed enhancements, it will become very convenient to implement Key-Value Store natively in Hadoop. MVCC Modified example of how MVCC can be implemented with the proposed enhancements from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC http://momjian.us/main/writings/pgsql/mvcc.pdf Data ID Activity Data Create Counter Data Expiry Counter Comments 1 Insert 40 MAX_VAL Conventionally MAX_VAL is null. In order to maintain update size, MAX_VAL is pre-seeded for our purposes. 1 Delete 40 47 Marked as delete when current counter was 47. 2 Update (old Delete) 64 78 Mark old data is DELETE 2 Update (new insert) 78 MAX_VAL Insert new data. Graph Stores Enable native storage and processing for a variety of graph stores. Graph Store 1 (Spark GraphX) 1. EdgeTable(pid, src, dst, data): stores the adjacency structure and edge data. Each edge is represented as a tuple consisting of the source vertex id, destination vertex id, and user-defined data as well as a virtual partition identifier (pid). Note that the edge table contains only the vertex ids and not the vertex data. The edge table is partitioned by the pid 2. VertexDataTable(id, data): stores the vertex data, in the form of a vertex (id, data) pairs. The vertex data table is indexed and partitioned by the vertex id. 3. VertexMap(id, pid): provides a mapping from the id of a vertex to the ids of the virtual partitions that contain adjacent edges. Graph Store 2 (Facebook Social Graph - TAO) Object: (id) → (otype,(key → value)∗ ) Assoc.: (id1,atype,id2) → (time,(key → value) ∗ ) WEB With the AHA enhancements, a variety of Web standards can be natively supported such as updateable JSON (http://json.org/), XML, RDF and other documents. RDF RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ The simplest triple statement is a sequence of (subject, predicate, object) terms, separated by whitespace and terminated by '.' after each triple. Mobile Apps Data and Resources With the enhancements proposed, in addition to the Web, Apps Data and Resources can also be managed using the Hadoop . Some examples of such usage can include App Data and Resources for Apple and other App stores. About Apps Resources: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html On-Demand Resources Essentials: https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/ Resource Programming Guide: https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf Temporal Data https://en.wikipedia.org/wiki/Temporal_database https://en.wikipedia.org/wiki/Valid_time In temporal data, data may get updated to reflect changes in data. For example data change from Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001) to Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995) Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000) Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001) Media Media production typically involves a lot of changes and updates prior to release. The enhancements will lay a basis for the full lifecycle to be managed in Hadoop ecosystem. Indexes With the changes, a variety of updatable indexes can be supported natively in Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn leverage Hadoop’s enhanced native capabilities. Natural Support for ETL and Analytics With native support for updates and deletes in addition to appends/inserts, Hadoop will have proper and natural support for ETL and Analytics. Google References While Google’s research in this area is interesting (and some extracts are listed hereunder), the evolution of Hadoop is quite interesting. Proposed enhancements to support in-place-update to the core Hadoop will enable and make it easier for a variety of enhancements for each of the Hadoop components. We propose a basis for allowing a system for incrementally processing updates to large data sets and reduce the overhead of always having to do large batches. Hadoop engines can dynamically choose processing style to use based on the type of data and volume of data sets and enhance/replace prevailing approaches. Year Title Links 2015 Announcing Google Cloud Bigtable: The same database that powers Google Search, Gmail and Analytics is now available on Google Cloud Platform http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html https://cloud.google.com/bigtable/ 2014 Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf 2013 F1: A Distributed SQL Database That Scales http://research.google.com/pubs/pub41344.html 2013 Online, Asynchronous Schema Change in F1 http://research.google.com/pubs/pub41376.html 2013 Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf 2012 F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business http://research.google.com/pubs/pub38125.html 2012 Spanner: Google's Globally-Distributed Database http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf 2012 Clydesdale: structured data processing on MapReduce http://dl.acm.org/citation.cfm?doid=2247596.2247600 2011 Megastore: Providing Scalable, Highly Available Storage for Interactive Services http://research.google.com/pubs/pub36971.html 2011 Tenzing A SQL Implementation On The MapReduce Framework http://research.google.com/pubs/pub37200.html 2010 Dremel: Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html 2010 FlumeJava: Easy, Efficient Data-Parallel Pipelines http://research.google.com/pubs/pub35650.html 2010 Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf Application Domains The enhancements will lay a path for comprehensive support of all application domains in Hadoop. A small collection is given hereunder. Data Warehousing and Enhanced ETL processing Supply Chain Planning Web Sites Mobile App Stores Financials Media Machine Learning Social Media Enterprise Applications such as ERP, CRM Corresponding umbrella JIRAs can be found for each of the following Hadoop platform components. > Advanced Hadoop Architecture (AHA) - Common > ------------------------------------------- > > Key: HADOOP-12620 > URL: https://issues.apache.org/jira/browse/HADOOP-12620 > Project: Hadoop Common > Issue Type: New Feature > Reporter: Dinesh S. Atreya > > h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA) > One main motivation for this JIRA is to address a comprehensive set of uses > with just minimal enhancements to Hadoop to transition Hadoop from a Modern > Data Architecture to Advanced/Cloud Data Architecture. > HDFS has traditionally had a write-once-read-many access model for files > until the introduction of “Append to files in HDFS” capability. The next > minimal enhancements to core Hadoop include capability to do > “updates-in-place” in HDFS. > • Support seeks for writes (in addition to reads). > • After seek, if the new byte length is the same as the old byte length, > in place update is allowed. > • Delete is an update with appropriate Delete marker > • If byte length is different, old entry is marked as delete with new one > appended as before. > • It is client’s discretion to perform either update, append or both and > the API changes in different Hadoop components should provide these > capabilities. > These minimal changes will enable laying the basis for transforming the core > Hadoop to an interactive and real-time platform and introducing significant > native capabilities to Hadoop. These enhancements will lay a foundation for > all of the following processing styles to be supported natively and > dynamically. > • Real time > • Mini-batch > • Stream based data processing > • Batch – which is the default now. > Hadoop engines can dynamically choose processing style to use based on the > type of data and volume of data sets and enhance/replace prevailing > approaches. > With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O > resources with increasing efficiency. The Hadoop task engines can use > vectorized/pipelined processing and greater use of memory throughout the > Hadoop platform. > These will enable enhanced performance optimizations to be implemented in > HDFS and made available to all the Hadoop components. This will enable Fast > processing of Big Data and enhance all the characteristics volume, velocity > and variety of big data. > There are many influences for this umbrella JIRA: > • Preserve and Accelerate Hadoop > • Efficient Data Management of variety of Data Formats natively in Hadoop > • Enterprise Expansion > • Internet and Media > • Databases offer native support for a variety of Data Formats such as > JSON, XML Indexes, and Temporal etc. – Hadoop should do the same. > It is quite probable that there may be many sub-JIRAs created to address > portions of this. This JIRA captures a variety of use-cases in one place. > Some Data Management /Platform initial use-cases are given hereunder: > h2. Key-Value Store > With the proposed enhancements, it will become very convenient to implement > Key-Value Store natively in Hadoop. > h2. MVCC > Modified example of how MVCC can be implemented with the proposed > enhancements from PostgreSQL MVCC is given hereunder. > https://wiki.postgresql.org/wiki/MVCC > http://momjian.us/main/writings/pgsql/mvcc.pdf > || Data ID || Activity || Data Create || Data Expiry || Comments > || || || Counter || Counter || Comments > | 1 | Insert | 40 | MAX_VAL | Conventionally MAX_VAL is null. > In order to maintain update size, MAX_VAL is pre-seeded for our purposes. > | 1 | Delete | 40 | 47 | Marked as delete when current counter > was 47. > | 2 | Update (old Delete) | 64 | 78 | Mark old data is DELETE > | 2 | Update (new insert) | 78 | MAX_VAL | Insert new data. > Graph Stores > Enable native storage and processing for a variety of graph stores. > Graph Store 1 (Spark GraphX) > 1. EdgeTable(pid, src, dst, data): stores the adjacency > structure and edge data. Each edge is represented as a > tuple consisting of the source vertex id, destination vertex id, > and user-defined data as well as a virtual partition identifier > (pid). Note that the edge table contains only the vertex ids > and not the vertex data. The edge table is partitioned by the > pid > 2. VertexDataTable(id, data): stores the vertex data, > in the form of a vertex (id, data) pairs. The vertex data table > is indexed and partitioned by the vertex id. > 3. VertexMap(id, pid): provides a mapping from the id > of a vertex to the ids of the virtual partitions that contain > adjacent edges. > Graph Store 2 (Facebook Social Graph - TAO) > Object: (id) → (otype,(key → value)∗ ) > Assoc.: (id1,atype,id2) → (time,(key → value) ∗ ) > WEB > With the AHA enhancements, a variety of Web standards can be natively > supported such as updateable JSON (http://json.org/), XML, RDF and other > documents. > RDF > RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ > RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ > The simplest triple statement is a sequence of (subject, predicate, object) > terms, separated by whitespace and terminated by '.' after each triple. > Mobile Apps Data and Resources > With the enhancements proposed, in addition to the Web, Apps Data and > Resources can also be managed using the Hadoop . Some examples of such usage > can include App Data and Resources for Apple and other App stores. > About Apps Resources: > https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html > > On-Demand Resources Essentials: > https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/ > > Resource Programming Guide: > https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf > > Temporal Data > https://en.wikipedia.org/wiki/Temporal_database > https://en.wikipedia.org/wiki/Valid_time > In temporal data, data may get updated to reflect changes in data. > For example data change from > Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) > Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001) > to > Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994) > Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995) > Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000) > Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001) > Media > Media production typically involves a lot of changes and updates prior to > release. The enhancements will lay a basis for the full lifecycle to be > managed in Hadoop ecosystem. > Indexes > With the changes, a variety of updatable indexes can be supported natively in > Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn > leverage Hadoop’s enhanced native capabilities. > Natural Support for ETL and Analytics > With native support for updates and deletes in addition to appends/inserts, > Hadoop will have proper and natural support for ETL and Analytics. > Google References > While Google’s research in this area is interesting (and some extracts are > listed hereunder), the evolution of Hadoop is quite interesting. Proposed > enhancements to support in-place-update to the core Hadoop will enable and > make it easier for a variety of enhancements for each of the Hadoop > components. > We propose a basis for allowing a system for incrementally processing updates > to large data sets and reduce the overhead of always having to do large > batches. Hadoop engines can dynamically choose processing style to use based > on the type of data and volume of data sets and enhance/replace prevailing > approaches. > Year Title Links > 2015 Announcing Google Cloud Bigtable: The same database that powers Google > Search, Gmail and Analytics is now available on Google Cloud Platform > http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html > https://cloud.google.com/bigtable/ > 2014 Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing > http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf > > 2013 F1: A Distributed SQL Database That Scales > http://research.google.com/pubs/pub41344.html > 2013 Online, Asynchronous Schema Change in F1 > http://research.google.com/pubs/pub41376.html > 2013 Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams > http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf > > 2012 F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad > Business http://research.google.com/pubs/pub38125.html > 2012 Spanner: Google's Globally-Distributed Database > http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf > > 2012 Clydesdale: structured data processing on MapReduce > http://dl.acm.org/citation.cfm?doid=2247596.2247600 > 2011 Megastore: Providing Scalable, Highly Available Storage for Interactive > Services http://research.google.com/pubs/pub36971.html > 2011 Tenzing A SQL Implementation On The MapReduce Framework > http://research.google.com/pubs/pub37200.html > 2010 Dremel: Interactive Analysis of Web-Scale Datasets > http://research.google.com/pubs/pub36632.html > 2010 FlumeJava: Easy, Efficient Data-Parallel Pipelines > http://research.google.com/pubs/pub35650.html > 2010 Percolator: Large-scale Incremental Processing Using Distributed > Transactions and Notifications http://research.google.com/pubs/pub36726.html > https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf > Application Domains > The enhancements will lay a path for comprehensive support of all application > domains in Hadoop. A small collection is given hereunder. > Data Warehousing and Enhanced ETL processing > Supply Chain Planning > Web Sites > Mobile App Stores > Financials > Media > Machine Learning > Social Media > Enterprise Applications such as ERP, CRM > Corresponding umbrella JIRAs can be found for each of the following Hadoop > platform components. -- This message was sent by Atlassian JIRA (v6.3.4#6332)