questions regarding data storage and inputformat
Hi Folks, I have a bunch of binary files which I've stored in a sequencefile. The name of the file is the key, the data is the value and I've stored them sorted by key. (I'm not tied to using a sequencefile for this). The current test data is only 50MB, but the real data will be 500MB - 1GB. My M/R job requires that it's input be several of these records in the sequence file, which is determined by the key. The sorting mentioned above keeps these all packed together. 1. Any reason not to use a sequence file for this? Perhaps a mapfile? Since I've sorted it, I don't need "random" accesses, but I do need to be aware of the keys, as I need to be sure that I get all of the relevant keys sent to a given mapper 2. Looks like I want a custom inputformat for this, extending SequenceFileInputFormat. Do you agree? I'll gladly take some opinions on this, as I ultimately want to split the based on what's in the file, which might be a little unorthodox. 3. Another idea might be create separate seq files for chunk of records and make them non-splittable, ensuring that they go to a single mapper. Assuming I can get away with this, see any pros/cons with that approach? Thanks, Tom -- === Skybox is hiring. http://www.skyboximaging.com/careers/jobs
Re: questions regarding data storage and inputformat
> 1. Any reason not to use a sequence file for this? Perhaps a mapfile? > Since I've sorted it, I don't need "random" accesses, but I do need > to be aware of the keys, as I need to be sure that I get all of the > relevant keys sent to a given mapper MapFile *may* be better here (see my answer for 2 below). > 2. Looks like I want a custom inputformat for this, extending > SequenceFileInputFormat. Do you agree? I'll gladly take some > opinions on this, as I ultimately want to split the based on what's in > the file, which might be a little unorthodox. If you need to split based on where certain keys are in the file, then a SequenceFile isn't a great solution. It would require that your InputFormat scan through all of the data just to find split points. Assuming you know what keys to split on ahead of time, you could use MapFiles and find the exact split point more quickly. > 3. Another idea might be create separate seq files for chunk of > records and make them non-splittable, ensuring that they go to a > single mapper. Assuming I can get away with this, see any pros/cons > with that approach? Separate sequence files would require the least amount of custom code. -Joey -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: questions regarding data storage and inputformat
> >> 3. Another idea might be create separate seq files for chunk of >> records and make them non-splittable, ensuring that they go to a >> single mapper. Assuming I can get away with this, see any pros/cons >> with that approach? > > Separate sequence files would require the least amount of custom code. > Thanks for the response, Joey. So, if I were to do the above, I would still need a custom record reader to put all the keys and values together, right? Thanks, Tom -- === Skybox is hiring. http://www.skyboximaging.com/careers/jobs
Re: questions regarding data storage and inputformat
You could either use a custom RecordReader or you could override the run() method on your Mapper class to do the merging before calling the map() method. -Joey On Wed, Jul 27, 2011 at 11:09 AM, Tom Melendez wrote: >> >>> 3. Another idea might be create separate seq files for chunk of >>> records and make them non-splittable, ensuring that they go to a >>> single mapper. Assuming I can get away with this, see any pros/cons >>> with that approach? >> >> Separate sequence files would require the least amount of custom code. >> > > Thanks for the response, Joey. > > So, if I were to do the above, I would still need a custom record > reader to put all the keys and values together, right? > > Thanks, > > Tom > > -- > === > Skybox is hiring. > http://www.skyboximaging.com/careers/jobs > -- Joseph Echeverria Cloudera, Inc. 443.305.9434