[ 
https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh S. Atreya updated HADOOP-12620:
--------------------------------------
    Description: 
h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)

One main motivation for this JIRA is to address a comprehensive set of uses 
with just minimal enhancements to Hadoop to transition Hadoop from a Modern 
Data Architecture to Advanced/Cloud Data Architecture. 

HDFS has traditionally had a write-once-read-many access model for files until 
the introduction of “Append to files in HDFS” capability. The next minimal 
enhancements to core Hadoop include capability to do “updates-in-place” in 
HDFS. 
•       Support seeks for writes (in addition to reads).
•       After seek, if the new byte length is the same as the old byte length, 
in place update is allowed.
•       Delete is an update with appropriate Delete marker
•       If byte length is different, old entry is marked as delete with new one 
appended as before. 
•       It is client’s discretion to perform either update, append or both and 
the API changes in different Hadoop components should provide these 
capabilities.

These minimal changes will enable laying the basis for transforming the core 
Hadoop to an interactive and real-time platform and introducing significant 
native capabilities to Hadoop. These enhancements will lay a foundation for all 
of the following processing styles to be supported natively and dynamically. 
•       Real time 
•       Mini-batch  
•       Stream based data processing
•       Batch – which is the default now.
Hadoop engines can dynamically choose processing style to use based on the type 
of data and volume of data sets and enhance/replace prevailing approaches.

With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O 
resources  with increasing efficiency. The Hadoop task engines can use 
vectorized/pipelined processing and greater use of memory throughout the Hadoop 
platform. 

These will enable enhanced performance optimizations to be implemented in HDFS 
and made available to all the Hadoop components. This will enable Fast 
processing of Big Data and enhance all the characteristics volume, velocity and 
variety of big data.

There are many influences for this umbrella JIRA:

•       Preserve and Accelerate Hadoop
•       Efficient Data Management of variety of Data Formats natively in Hadoop
•       Enterprise Expansion 
•       Internet and Media 
•       Databases offer native support for a variety of Data Formats such as 
JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.

It is quite probable that there may be many sub-JIRAs created to address 
portions of this. This JIRA captures a variety of use-cases in one place.  Some 
Data Management /Platform initial use-cases are given hereunder:

h2. Key-Value Store
With the proposed enhancements, it will become very convenient to implement 
Key-Value Store natively in Hadoop.

h2. MVCC 

Modified example of how MVCC can be implemented with the proposed enhancements 
from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC 
http://momjian.us/main/writings/pgsql/mvcc.pdf 



|| Data ID || Activity || Data Create || Data Expiry || Comments
||               ||             || Counter       ||  Counter    || Comments
| 1  | Insert   | 40    | MAX_VAL       | Conventionally MAX_VAL is null.
In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
| 1     | Delete        | 40    | 47    | Marked as delete when current counter 
was 47.
| 2     | Update (old Delete)   | 64    | 78    | Mark old data is DELETE
| 2     | Update (new insert)   | 78    | MAX_VAL       | Insert new data.




Graph Stores
Enable native storage and processing for a variety of graph stores. 

Graph Store 1 (Spark GraphX)
1. EdgeTable(pid, src, dst, data): stores the adjacency 
structure and edge data. Each edge is represented as a
tuple consisting of the source vertex id, destination vertex id,
and user-defined data as well as a virtual partition identifier
(pid). Note that the edge table contains only the vertex ids
and not the vertex data. The edge table is partitioned by the
pid
2. VertexDataTable(id, data): stores the vertex data,
in the form of a vertex (id, data) pairs. The vertex data table
is indexed and partitioned by the vertex id.
3. VertexMap(id, pid): provides a mapping from the id
of a vertex to the ids of the virtual partitions that contain
adjacent edges.  

Graph Store 2 (Facebook Social Graph - TAO)

Object:  (id) → (otype,(key → value)∗ )
Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )

WEB
With the AHA enhancements, a variety of Web standards can be natively supported 
 such as updateable JSON (http://json.org/), XML, RDF and other documents.


RDF
RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
The simplest triple statement is a sequence of (subject, predicate, object) 
terms, separated by whitespace and terminated by '.' after each triple.

Mobile Apps Data and Resources

With the enhancements proposed, in addition to the Web, Apps Data and Resources 
can also be managed using the Hadoop . Some examples of such usage can include 
App Data and Resources for Apple and other App stores.

About Apps Resources: 
https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
 
On-Demand Resources Essentials: 
https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
 
Resource Programming Guide: 
https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
 



Temporal Data 
https://en.wikipedia.org/wiki/Temporal_database 
https://en.wikipedia.org/wiki/Valid_time 
In temporal data, data may get updated to reflect changes in data.
For example data change from 
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
to
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)

Media
Media production typically involves a lot of changes and updates prior to 
release. The enhancements will lay a basis for the full lifecycle to be managed 
in Hadoop ecosystem. 
Indexes
With the changes, a variety of updatable indexes can be supported natively in 
Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn 
leverage Hadoop’s enhanced native capabilities. 

Natural Support for ETL and Analytics
With native support for updates and deletes in addition to appends/inserts, 
Hadoop will have proper and natural support for ETL and Analytics.

Google References

While Google’s research in this area is interesting (and some extracts are 
listed hereunder), the evolution of Hadoop is quite interesting. Proposed 
enhancements to support in-place-update to the core Hadoop will enable and make 
it easier for a variety of enhancements for each of the Hadoop components.

We propose a basis for allowing a system for incrementally processing updates 
to large data sets and reduce the overhead of always having to do large 
batches. Hadoop engines can dynamically choose processing style to use based on 
the type of data and volume of data sets and enhance/replace prevailing 
approaches.


Year    Title   Links
2015    Announcing Google Cloud Bigtable: The same database that powers Google 
Search, Gmail and Analytics is now available on Google Cloud Platform 
http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
https://cloud.google.com/bigtable/ 
2014    Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
 
2013    F1: A Distributed SQL Database That Scales      
http://research.google.com/pubs/pub41344.html 
2013    Online, Asynchronous Schema Change in F1        
http://research.google.com/pubs/pub41376.html 
2013    Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams  
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
 
2012    F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad 
Business       http://research.google.com/pubs/pub38125.html 
2012    Spanner: Google's Globally-Distributed Database 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
 
2012    Clydesdale: structured data processing on MapReduce     
http://dl.acm.org/citation.cfm?doid=2247596.2247600 
2011    Megastore: Providing Scalable, Highly Available Storage for Interactive 
Services        http://research.google.com/pubs/pub36971.html 
2011    Tenzing A SQL Implementation On The MapReduce Framework 
http://research.google.com/pubs/pub37200.html 
2010    Dremel: Interactive Analysis of Web-Scale Datasets      
http://research.google.com/pubs/pub36632.html 
2010    FlumeJava: Easy, Efficient Data-Parallel Pipelines      
http://research.google.com/pubs/pub35650.html 
2010    Percolator: Large-scale Incremental Processing Using Distributed 
Transactions and Notifications http://research.google.com/pubs/pub36726.html
https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 

Application Domains

The enhancements will lay a path for comprehensive support of all application 
domains in Hadoop. A small collection is given hereunder.

Data Warehousing and Enhanced ETL processing  
Supply Chain Planning
Web Sites 
Mobile App Stores
Financials 
Media 
Machine Learning
Social Media
Enterprise Applications such as ERP, CRM 


Corresponding umbrella JIRAs can be found for each of the following Hadoop 
platform components. 


  was:
h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)

One main motivation for this JIRA is to address a comprehensive set of uses 
with just minimal enhancements to Hadoop to transition Hadoop from a Modern 
Data Architecture to Advanced/Cloud Data Architecture. 

HDFS has traditionally had a write-once-read-many access model for files until 
the introduction of “Append to files in HDFS” capability. The next minimal 
enhancements to core Hadoop include capability to do “updates-in-place” in 
HDFS. 
•       Support seeks for writes (in addition to reads).
•       After seek, if the new byte length is the same as the old byte length, 
in place update is allowed.
•       Delete is an update with appropriate Delete marker
•       If byte length is different, old entry is marked as delete with new one 
appended as before. 
•       It is client’s discretion to perform either update, append or both and 
the API changes in different Hadoop components should provide these 
capabilities.

These minimal changes will enable laying the basis for transforming the core 
Hadoop to an interactive and real-time platform and introducing significant 
native capabilities to Hadoop. These enhancements will lay a foundation for all 
of the following processing styles to be supported natively and dynamically. 
•       Real time 
•       Mini-batch  
•       Stream based data processing
•       Batch – which is the default now.
Hadoop engines can dynamically choose processing style to use based on the type 
of data and volume of data sets and enhance/replace prevailing approaches.

With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O 
resources  with increasing efficiency. The Hadoop task engines can use 
vectorized/pipelined processing and greater use of memory throughout the Hadoop 
platform. 

These will enable enhanced performance optimizations to be implemented in HDFS 
and made available to all the Hadoop components. This will enable Fast 
processing of Big Data and enhance all the characteristics volume, velocity and 
variety of big data.

There are many influences for this umbrella JIRA:

•       Preserve and Accelerate Hadoop
•       Efficient Data Management of variety of Data Formats natively in Hadoop
•       Enterprise Expansion 
•       Internet and Media 
•       Databases offer native support for a variety of Data Formats such as 
JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.

It is quite probable that there may be many sub-JIRAs created to address 
portions of this. This JIRA captures a variety of use-cases in one place.  Some 
Data Management /Platform initial use-cases are given hereunder:

Key-Value Store
With the proposed enhancements, it will become very convenient to implement 
Key-Value Store natively in Hadoop.

MVCC 

Modified example of how MVCC can be implemented with the proposed enhancements 
from PostgreSQL MVCC is given hereunder. https://wiki.postgresql.org/wiki/MVCC 
http://momjian.us/main/writings/pgsql/mvcc.pdf 

Data 
ID      Activity        Data Create Counter     Data Expiry
Counter Comments
1       Insert  40      MAX_VAL Conventionally MAX_VAL is null.
In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
1       Delete  40      47      Marked as delete when current counter was 47.
2       Update (old Delete)     64       78     Mark old data is DELETE
2       Update (new insert)     78      MAX_VAL Insert new data.


Graph Stores
Enable native storage and processing for a variety of graph stores. 

Graph Store 1 (Spark GraphX)
1. EdgeTable(pid, src, dst, data): stores the adjacency 
structure and edge data. Each edge is represented as a
tuple consisting of the source vertex id, destination vertex id,
and user-defined data as well as a virtual partition identifier
(pid). Note that the edge table contains only the vertex ids
and not the vertex data. The edge table is partitioned by the
pid
2. VertexDataTable(id, data): stores the vertex data,
in the form of a vertex (id, data) pairs. The vertex data table
is indexed and partitioned by the vertex id.
3. VertexMap(id, pid): provides a mapping from the id
of a vertex to the ids of the virtual partitions that contain
adjacent edges.  

Graph Store 2 (Facebook Social Graph - TAO)

Object:  (id) → (otype,(key → value)∗ )
Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )

WEB
With the AHA enhancements, a variety of Web standards can be natively supported 
 such as updateable JSON (http://json.org/), XML, RDF and other documents.


RDF
RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
The simplest triple statement is a sequence of (subject, predicate, object) 
terms, separated by whitespace and terminated by '.' after each triple.

Mobile Apps Data and Resources

With the enhancements proposed, in addition to the Web, Apps Data and Resources 
can also be managed using the Hadoop . Some examples of such usage can include 
App Data and Resources for Apple and other App stores.

About Apps Resources: 
https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
 
On-Demand Resources Essentials: 
https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
 
Resource Programming Guide: 
https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
 



Temporal Data 
https://en.wikipedia.org/wiki/Temporal_database 
https://en.wikipedia.org/wiki/Valid_time 
In temporal data, data may get updated to reflect changes in data.
For example data change from 
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
to
Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)

Media
Media production typically involves a lot of changes and updates prior to 
release. The enhancements will lay a basis for the full lifecycle to be managed 
in Hadoop ecosystem. 
Indexes
With the changes, a variety of updatable indexes can be supported natively in 
Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn 
leverage Hadoop’s enhanced native capabilities. 

Natural Support for ETL and Analytics
With native support for updates and deletes in addition to appends/inserts, 
Hadoop will have proper and natural support for ETL and Analytics.

Google References

While Google’s research in this area is interesting (and some extracts are 
listed hereunder), the evolution of Hadoop is quite interesting. Proposed 
enhancements to support in-place-update to the core Hadoop will enable and make 
it easier for a variety of enhancements for each of the Hadoop components.

We propose a basis for allowing a system for incrementally processing updates 
to large data sets and reduce the overhead of always having to do large 
batches. Hadoop engines can dynamically choose processing style to use based on 
the type of data and volume of data sets and enhance/replace prevailing 
approaches.


Year    Title   Links
2015    Announcing Google Cloud Bigtable: The same database that powers Google 
Search, Gmail and Analytics is now available on Google Cloud Platform 
http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
https://cloud.google.com/bigtable/ 
2014    Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
 
2013    F1: A Distributed SQL Database That Scales      
http://research.google.com/pubs/pub41344.html 
2013    Online, Asynchronous Schema Change in F1        
http://research.google.com/pubs/pub41376.html 
2013    Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams  
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
 
2012    F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad 
Business       http://research.google.com/pubs/pub38125.html 
2012    Spanner: Google's Globally-Distributed Database 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
 
2012    Clydesdale: structured data processing on MapReduce     
http://dl.acm.org/citation.cfm?doid=2247596.2247600 
2011    Megastore: Providing Scalable, Highly Available Storage for Interactive 
Services        http://research.google.com/pubs/pub36971.html 
2011    Tenzing A SQL Implementation On The MapReduce Framework 
http://research.google.com/pubs/pub37200.html 
2010    Dremel: Interactive Analysis of Web-Scale Datasets      
http://research.google.com/pubs/pub36632.html 
2010    FlumeJava: Easy, Efficient Data-Parallel Pipelines      
http://research.google.com/pubs/pub35650.html 
2010    Percolator: Large-scale Incremental Processing Using Distributed 
Transactions and Notifications http://research.google.com/pubs/pub36726.html
https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 

Application Domains

The enhancements will lay a path for comprehensive support of all application 
domains in Hadoop. A small collection is given hereunder.

Data Warehousing and Enhanced ETL processing  
Supply Chain Planning
Web Sites 
Mobile App Stores
Financials 
Media 
Machine Learning
Social Media
Enterprise Applications such as ERP, CRM 


Corresponding umbrella JIRAs can be found for each of the following Hadoop 
platform components. 



> Advanced Hadoop Architecture (AHA) - Common
> -------------------------------------------
>
>                 Key: HADOOP-12620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12620
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Dinesh S. Atreya
>
> h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)
> One main motivation for this JIRA is to address a comprehensive set of uses 
> with just minimal enhancements to Hadoop to transition Hadoop from a Modern 
> Data Architecture to Advanced/Cloud Data Architecture. 
> HDFS has traditionally had a write-once-read-many access model for files 
> until the introduction of “Append to files in HDFS” capability. The next 
> minimal enhancements to core Hadoop include capability to do 
> “updates-in-place” in HDFS. 
> •     Support seeks for writes (in addition to reads).
> •     After seek, if the new byte length is the same as the old byte length, 
> in place update is allowed.
> •     Delete is an update with appropriate Delete marker
> •     If byte length is different, old entry is marked as delete with new one 
> appended as before. 
> •     It is client’s discretion to perform either update, append or both and 
> the API changes in different Hadoop components should provide these 
> capabilities.
> These minimal changes will enable laying the basis for transforming the core 
> Hadoop to an interactive and real-time platform and introducing significant 
> native capabilities to Hadoop. These enhancements will lay a foundation for 
> all of the following processing styles to be supported natively and 
> dynamically. 
> •     Real time 
> •     Mini-batch  
> •     Stream based data processing
> •     Batch – which is the default now.
> Hadoop engines can dynamically choose processing style to use based on the 
> type of data and volume of data sets and enhance/replace prevailing 
> approaches.
> With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O 
> resources  with increasing efficiency. The Hadoop task engines can use 
> vectorized/pipelined processing and greater use of memory throughout the 
> Hadoop platform. 
> These will enable enhanced performance optimizations to be implemented in 
> HDFS and made available to all the Hadoop components. This will enable Fast 
> processing of Big Data and enhance all the characteristics volume, velocity 
> and variety of big data.
> There are many influences for this umbrella JIRA:
> •     Preserve and Accelerate Hadoop
> •     Efficient Data Management of variety of Data Formats natively in Hadoop
> •     Enterprise Expansion 
> •     Internet and Media 
> •     Databases offer native support for a variety of Data Formats such as 
> JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.
> It is quite probable that there may be many sub-JIRAs created to address 
> portions of this. This JIRA captures a variety of use-cases in one place.  
> Some Data Management /Platform initial use-cases are given hereunder:
> h2. Key-Value Store
> With the proposed enhancements, it will become very convenient to implement 
> Key-Value Store natively in Hadoop.
> h2. MVCC 
> Modified example of how MVCC can be implemented with the proposed 
> enhancements from PostgreSQL MVCC is given hereunder. 
> https://wiki.postgresql.org/wiki/MVCC 
> http://momjian.us/main/writings/pgsql/mvcc.pdf 
> || Data ID || Activity || Data Create || Data Expiry || Comments
> ||               ||             || Counter       ||  Counter  || Comments
> | 1  | Insert | 40    | MAX_VAL       | Conventionally MAX_VAL is null.
> In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
> | 1   | Delete        | 40    | 47    | Marked as delete when current counter 
> was 47.
> | 2   | Update (old Delete)   | 64    | 78    | Mark old data is DELETE
> | 2   | Update (new insert)   | 78    | MAX_VAL       | Insert new data.
> Graph Stores
> Enable native storage and processing for a variety of graph stores. 
> Graph Store 1 (Spark GraphX)
> 1. EdgeTable(pid, src, dst, data): stores the adjacency 
> structure and edge data. Each edge is represented as a
> tuple consisting of the source vertex id, destination vertex id,
> and user-defined data as well as a virtual partition identifier
> (pid). Note that the edge table contains only the vertex ids
> and not the vertex data. The edge table is partitioned by the
> pid
> 2. VertexDataTable(id, data): stores the vertex data,
> in the form of a vertex (id, data) pairs. The vertex data table
> is indexed and partitioned by the vertex id.
> 3. VertexMap(id, pid): provides a mapping from the id
> of a vertex to the ids of the virtual partitions that contain
> adjacent edges.  
> Graph Store 2 (Facebook Social Graph - TAO)
> Object:  (id) → (otype,(key → value)∗ )
> Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
> WEB
> With the AHA enhancements, a variety of Web standards can be natively 
> supported  such as updateable JSON (http://json.org/), XML, RDF and other 
> documents.
> RDF
> RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
> RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
> The simplest triple statement is a sequence of (subject, predicate, object) 
> terms, separated by whitespace and terminated by '.' after each triple.
> Mobile Apps Data and Resources
> With the enhancements proposed, in addition to the Web, Apps Data and 
> Resources can also be managed using the Hadoop . Some examples of such usage 
> can include App Data and Resources for Apple and other App stores.
> About Apps Resources: 
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
>  
> On-Demand Resources Essentials: 
> https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
>  
> Resource Programming Guide: 
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
>  
> Temporal Data 
> https://en.wikipedia.org/wiki/Temporal_database 
> https://en.wikipedia.org/wiki/Valid_time 
> In temporal data, data may get updated to reflect changes in data.
> For example data change from 
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
> to
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
> Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
> Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
> Media
> Media production typically involves a lot of changes and updates prior to 
> release. The enhancements will lay a basis for the full lifecycle to be 
> managed in Hadoop ecosystem. 
> Indexes
> With the changes, a variety of updatable indexes can be supported natively in 
> Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn 
> leverage Hadoop’s enhanced native capabilities. 
> Natural Support for ETL and Analytics
> With native support for updates and deletes in addition to appends/inserts, 
> Hadoop will have proper and natural support for ETL and Analytics.
> Google References
> While Google’s research in this area is interesting (and some extracts are 
> listed hereunder), the evolution of Hadoop is quite interesting. Proposed 
> enhancements to support in-place-update to the core Hadoop will enable and 
> make it easier for a variety of enhancements for each of the Hadoop 
> components.
> We propose a basis for allowing a system for incrementally processing updates 
> to large data sets and reduce the overhead of always having to do large 
> batches. Hadoop engines can dynamically choose processing style to use based 
> on the type of data and volume of data sets and enhance/replace prevailing 
> approaches.
> Year  Title   Links
> 2015  Announcing Google Cloud Bigtable: The same database that powers Google 
> Search, Gmail and Analytics is now available on Google Cloud Platform 
> http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
> https://cloud.google.com/bigtable/ 
> 2014  Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
>  
> 2013  F1: A Distributed SQL Database That Scales      
> http://research.google.com/pubs/pub41344.html 
> 2013  Online, Asynchronous Schema Change in F1        
> http://research.google.com/pubs/pub41376.html 
> 2013  Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams  
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
>  
> 2012  F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad 
> Business       http://research.google.com/pubs/pub38125.html 
> 2012  Spanner: Google's Globally-Distributed Database 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
>  
> 2012  Clydesdale: structured data processing on MapReduce     
> http://dl.acm.org/citation.cfm?doid=2247596.2247600 
> 2011  Megastore: Providing Scalable, Highly Available Storage for Interactive 
> Services        http://research.google.com/pubs/pub36971.html 
> 2011  Tenzing A SQL Implementation On The MapReduce Framework 
> http://research.google.com/pubs/pub37200.html 
> 2010  Dremel: Interactive Analysis of Web-Scale Datasets      
> http://research.google.com/pubs/pub36632.html 
> 2010  FlumeJava: Easy, Efficient Data-Parallel Pipelines      
> http://research.google.com/pubs/pub35650.html 
> 2010  Percolator: Large-scale Incremental Processing Using Distributed 
> Transactions and Notifications http://research.google.com/pubs/pub36726.html
> https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 
> Application Domains
> The enhancements will lay a path for comprehensive support of all application 
> domains in Hadoop. A small collection is given hereunder.
> Data Warehousing and Enhanced ETL processing  
> Supply Chain Planning
> Web Sites 
> Mobile App Stores
> Financials 
> Media 
> Machine Learning
> Social Media
> Enterprise Applications such as ERP, CRM 
> Corresponding umbrella JIRAs can be found for each of the following Hadoop 
> platform components. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to