Jonathan I understand your confusion. Hadoop and Big Data have reached overused but not well understood status years ago.
First, Hadoop started out at a MapReduce engine. This all changed with Hadoop V2 and YARN (Yet Another Resource Negotiator) Hadoop V2 can be considered a platform on which applications that need parallel access to large amounts of unstructured data (i.e. raw data not in a traditional database. It can also used with its own database HBase, which is based on Google Big Table. The idea is this, a "Hadoop" cluster has a large amount of storage using HDFS (or possibly another parallel filesystem) This is often referred to as the "Data Lake." Raw data is dumped in the lake. There is no ETL (Extract Transform and Load) step. Various Hadoop YARN frameworks use this data. YARN provides a very dynamic resource allocation model and the ability to provide data locality to your application (i.e. the traditional MapReduce idea was "move the computation to the data") Thus in a Hadoop V2 cluster you can have MapReduce applications (which support many of the the popular apps like Pig and Hive) It also supports Spark, Storm, Giraph and even MPI (not the most efficient but it works) There are many other applications being ported to YARN. Second, Big Data is usually defined by Volume, Velocity, and Variety. The definition seems to be what ever a vendor wants it to be, however. It reminds me of products that suddenly became "grid ready" in years past. Again such designations mean as much as "now works with binary data" Finally, if you are interested in Hadoop YARN you can check out the book "Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2" (I helped write it). There also many online resources. The first chapter of the book has the history of Hadoop as written by one of the developers. It is quite interested to read and helps dispel many of the Hadoop myths. You can read this chapter for free here: http://ptgmedia.pearsoncmg.com/images/9780321934505/samplepages/0321934504.pdf That is enough Hadoop for Saturday morning. Oh, and Hadoop clusters are not going to supplant your HPC cluster. -- Doug > > Can someone explain to me what exactly the purpose of hadoop is and what > we mean when we say big data? Is this for data storage and retrieval? > Number crunching? > > -- > Regards, > Jonathan Aquilina > Founder Eagle Eye T > > -- > Mailscanner: Clean > > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug -- Mailscanner: Clean _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
