Hi Donal- On Fri, Nov 11, 2011 at 10:12:44PM +0800, ?????? wrote: > My scenario is that I have lots of files from High Energy Physics experiment. > These files are in binary format,about 2G each, but basically they are > composed by lots of "Event", each Event is independent with others. The > physicists use a C++ program called ROOT to analysis these files,and write the > output to a result file(use open(),read(),write()). I'm considering how to > store the files in HDFS, and use the Map-reduce to analize them.
May I ask which experiment you're working on? We run a HDFS cluster at one of the analysis centers for the CMS detector at the LHC. I'm not aware of anyone using Hadoop's MR for analysis, though about 10 PB of LHC data is now stored in HDFS. For your/our use case, I think that you would have to implement a domain-specific InputFormat yielding Events. ROOT files would be stored as-is in HDFS. In CMS, we mostly run traditional HEP simulation and analysis workflows using plain batch jobs managed by common schedulers like Condor or PBS. These of course lack some of the features of the MR schedulers (like location awareness), but have some advantages. For example, we run Condor schedulers that transparently manage workflows of tens of thousands of jobs on dozens of heterogeneous clusters across North America. Feel free to contact me off-list if have more HEP-specific questions about HDFS. Thanks! -- Will Maier - UW High Energy Physics cel: 608.438.6162 tel: 608.263.9692 web: http://www.hep.wisc.edu/~wcmaier/
smime.p7s
Description: S/MIME cryptographic signature