Time for my two cents...

For the best understanding of Hadoop, I think Google's original papers on MapReduce and GFS (Google File System) are still the best starting source. If for no other reason, they were written before the the Hadoop hype-train left the station, so they don't claim that MapReduce can fix every problem.

http://research.google.com/archive/mapreduce.html
http://research.google.com/archive/gfs.html


On 02/07/2015 01:38 PM, Douglas Eadline wrote:
Hello Jonathan.
Here it is a good document to get you thinking.
http://www.cs.berkeley.edu/~rxin/db-papers/WarehouseScaleComputing.pdf

Although Doug said "Oh, and Hadoop clusters are not going to supplant your
HPC
cluster"
I should have continued, ... and there will be overlap.

Definitely. Glen Lockwood has written a great article on exactly this, which I think sums up the issues perfectly and has been discussed on this list in the past:

http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html

In my opinion, a common source of confusion is that people are using the term 'big data' to refer to ALL data that takes up GBs, TBs, or PBs. My definition includes 'unstructured data from disparate sources', which means data from different sources, in different formats. A relational database, like Oracle or MySQL, regardless of its size doesn't fit this definition, since it's structured. I also don't consider the output of HPC simulations to be 'big data' since the output files of those simulations are usually structured in some sort of way (HDF5, or whatever). Are those example a LOT of data? Yes, but I don't consider them to be 'big data'.

I hope I didn't just start a religious war.

--
Prentice


_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to