Thanks Bejoy, that help a lot!
2011/11/11, Bejoy KS <bejoy.had...@gmail.com>: > Hi Donal > I don't have much of an expose to the domain which you are > pointing on to, but from a plain map reduce developer terms there would be > my way of looking into processing such data format with map reduce > - If the data is kind of flowing in continuously then I'd use flume to > collect the binary data and write the same into sequence files and load > into hdfs > - If it is already existing large data, I'd use a sequence file writer to > write the binary data as sequence files into hdfs. Where hdfs would take > care of the splits. > - I'd use SequenceFileInputFormat for my map reduce > - If my application code is in other compatible language than java then I'd > be using Streaming API to trigger my map reduce job. > > If there is any specific constraints with reading your data, as Will > metioned you may need to go in with your custom Input Formats for > processing the same. > > > Hope it helps!... > > > On Fri, Nov 11, 2011 at 8:12 PM, Charles Earl <charlesce...@me.com> wrote: > >> Hi, >> Please also feel free to contact me. I'm working with STAR project at >> Brookhaven Lab, and we are trying to build a MR workflow for analysis of >> particle data. I've done some preliminary experiments running Root and >> other nuclear physics analysis software in MR and have been looking at >> various file layouts. >> Charles >> On Nov 11, 2011, at 9:26 AM, Will Maier wrote: >> >> > Hi Donal- >> > >> > On Fri, Nov 11, 2011 at 10:12:44PM +0800, ?????? wrote: >> >> My scenario is that I have lots of files from High Energy Physics >> experiment. >> >> These files are in binary format,about 2G each, but basically they are >> >> composed by lots of "Event", each Event is independent with others. The >> >> physicists use a C++ program called ROOT to analysis these files,and >> write the >> >> output to a result file(use open(),read(),write()). I'm considering >> how to >> >> store the files in HDFS, and use the Map-reduce to analize them. >> > >> > May I ask which experiment you're working on? We run a HDFS cluster at >> one of >> > the analysis centers for the CMS detector at the LHC. I'm not aware of >> anyone >> > using Hadoop's MR for analysis, though about 10 PB of LHC data is now >> stored in >> > HDFS. For your/our use case, I think that you would have to implement a >> > domain-specific InputFormat yielding Events. ROOT files would be stored >> as-is in >> > HDFS. >> > >> > In CMS, we mostly run traditional HEP simulation and analysis workflows >> using >> > plain batch jobs managed by common schedulers like Condor or PBS. These >> of >> > course lack some of the features of the MR schedulers (like location >> awareness), >> > but have some advantages. For example, we run Condor schedulers that >> > transparently manage workflows of tens of thousands of jobs on dozens of >> > heterogeneous clusters across North America. >> > >> > Feel free to contact me off-list if have more HEP-specific questions >> about HDFS. >> > >> > Thanks! >> > >> > -- >> > >> > Will Maier - UW High Energy Physics >> > cel: 608.438.6162 >> > tel: 608.263.9692 >> > web: http://www.hep.wisc.edu/~wcmaier/ >> >> >