Re: Reducer-side join example
Hi, Your question has an academic sound, so I'll give it an academic answer ;). Unfortunately, there are not really any good generalized (ie. cross join a large matrix with a large matrix) methods for doing joins in map-reduce. The fundamental reason for this is that in the general case you're comparing everything to everything, and so for each pair of possible rows, you must actually generate each pair of rows. This means every node ships all its data to every other node, no matter what (in the general case). I bring this up not because you're looking to optimize cross joining, but because it demonstrates the point that you will exploit the characteristics of your data no matter what strategy you choose, and each will have domain-specific flaws and advantages. The typical strategy for a reduce side join is to use hadoop's sorting functionality to group rows by their keys, such that the entire data set for a particular key will be resident on a single reducer. The key insight is that you're thinking about the join as a sorting problem. Yes this means you risk producing data sets that fill your reducers, but thats a trade-off that you accept to reduce the complexity of the original problem. If the existing join framework in hadoop (whose javadocs are quite thorough) is inadequate, you shouldn't be afraid to invent, implement, and test join strategies that are specific to your domain. On Tue, Apr 6, 2010 at 11:01 AM, M B machac...@gmail.com wrote: Thanks, I appreciate the example - what happens if File A and B have many more columns (all different data types)? The logic doesn't seem to work in that case - unless we set up the values in the Map function to include the file name (maybe the output value is a HashMap or something, which might work). Also, I was asking to see a reduce-side join as we have other things going on in the Mapper and I'm not sure if we can tweak it's output (we send output to multiple places). Does anyone have an example using the contrib/DataJoin or something similar? thanks On Mon, Apr 5, 2010 at 7:03 PM, He Chen airb...@gmail.com wrote: For the Map function: Input key: default input value: File A and File B lines output key: A, B, C,(first colomn of the final result) output value: 12, 24, Car, 13, Van, SUV... Reduce function: take the Map output and do: for each key { if the value of a key is integer then same it to array1; else save it to array2 } for ith element in array1 for jth element in array2 output(key, array1[i]+\t+array2[j]); done Hope this helps. On Mon, Apr 5, 2010 at 4:10 PM, M B machac...@gmail.com wrote: Hi, I need a good java example to get me started with some joining we need to do, any examples would be appreciated. File A: Field1 Field2 A12 B13 C22 A24 File B: Field1 Field2 Field3 ACar ... BTruck... BSUV ... BVan ... So, we need to first join File A and B on Field1 (say both are string fields). The result would just be: A 12 Car ... A 24 Car ... B 13 Truck ... B 13 SUV ... B 13 Van ... and so on - with all the fields from both files returning. Once we have that, we sometimes need to then transform it so we have a single record per key (Field1): A (12,Car) (24,Car) B (13,Truck) (13,SUV) (13,Van) --however it looks, basically tuples for each key (we'll modify this later to return a conatenated set of fields from B, etc) At other times, instead of transforming to a single row, we just need to modify rows based on values. So if B.Field2 equals Van, we need to set Output.Field2 = whatever then output to file ... Are there any good examples of this in native java (we can't use pig/hive/etc)? thanks. -- Best Wishes! -- Chen He PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: Help on processing large amount of videos on hadoop
Hi Huazhong, Sounds like an interesting application. Here's a few tips. 1. If the frames are not independent, you should find a way to key them according to their order before dumping them in Hadoop so that they can be sorted as part of your map reduce task. BTW, the video won't appear split while its in HDFS; HDFS does use a block splitting scheme for replication and (sort of) job distribution, but this isn't mandatory, and there's lots of facilites to customize this behavior. 2. Is the audio needed? If not, it may make sense to preprocess the data as key, image, where the key is something like what I mentioned above, and the image is a custom writable that handles the data in a common format like gif, jpg, png, whatever. 3. You'll need to have some in-depth knowledge of your video codec for this. While I'm not an expert on video codecs, I think that many of them do their compression by specifying key frames that have complete data, and then representing subsequent frames as differences with a key frame. You can use a custom input format and split to split on a frame, but you will need to be an expert on your codec to do so. It might be easier to use a framework to pre-transcode the data into whatever you will use for your map reduce jobs. On Thu, Dec 17, 2009 at 2:25 PM, Huazhong Ning n...@akiira.com wrote: Hi, I set up a hadoop platform and I am going to use it to process a large amount of videos (each size is about 500M-1G). But I met some hard issues: 1. The frames in each video are not independent so we may have problems if we split the video into blocks and distribute them in HDFS. 2. The video is compressed but we hope the input to the map class is video frames. In other words we need to put the codec somewhere. 3. Our codec (third party source code) takes video file name as input. Can we get the file name? Any suggestions and comments are welcome. Thanks a lot. Ning
Re: Can hadoop 0.20.1 programs runs on Amazon Elastic Mapreduce?
Last time I checked EMR only runs 0.18.3. You can use EC2 though, which winds up being cheaper anyways. On Wed, Dec 16, 2009 at 8:51 PM, 松柳 lamfeeli...@gmail.com wrote: Hi all, I'm wondering whether Amazon starts to support the newest stable version of Hadoop, or we can still just use 0.18.3? Song Liu
Re: multiple file input
One important thing to note is that, with cross products, you'll almost always get better performance if you can fit both files on a single node's disk rather than distributing the files. On Tue, Dec 8, 2009 at 9:18 AM, laser08150815 la...@laserxyz.de wrote: pmg wrote: I am evaluating hadoop for a problem that do a Cartesian product of input from one file of 600K (File A) with another set of file set (FileB1, FileB2, FileB3) with 2 millions line in total. Each line from FileA gets compared with every line from FileB1, FileB2 etc. etc. FileB1, FileB2 etc. are in a different input directory So Two input directories 1. input1 directory with a single file of 600K records - FileA 2. input2 directory segmented into different files with 2Million records - FileB1, FileB2 etc. How can I have a map that reads a line from a FileA in directory input1 and compares the line with each line from input2? What is the best way forward? I have seen plenty of examples that maps each record from single input file and reduces into an output forward. thanks I had a similar problem and solved it by writing a custom InputFormat (see attachment). You should improve the methods ACrossBInputSplit.getLength , ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress. -- View this message in context: http://old.nabble.com/multiple-file-input-tp24095358p26694569.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: RE: Using Hadoop in non-typical large scale user-driven environment
As far as replication goes, you should look at a project called pastry. Apparently some people have used hadoop mapreduce on top of it. You will need to be clever, however, in how you do your mapreduce because you probably won't want the job to eat all the users cpu time. On Dec 2, 2009 5:11 PM, Habermaas, William william.haberm...@fatwire.com wrote: Hadoop isn't going to like losing its datanodes when people shutdown their computers. More importantly, when the datanodes are running, your users will be impacted by data replication. Unlike Seti, Hadoop doesn't know when the user's screensaver is running so it will start doing things when it feels like it. Can someone else comment on whether HOD (hadoop-on-demand) would fit this scenario? Bill -Original Message- From: Maciej Trebacz [mailto: maciej.treb...@gmail.com] Sent: Wednesday,...
Re: New graphic interface for Hadoop - Contains: FileManager, Daemon Admin, Quick Stream Job Setup, etc
The tool looks interesting. You should consider providing the source for it. Is it written in a language that can run on platforms besides windows? On Nov 17, 2009 10:40 AM, Cubic cubicdes...@gmail.com wrote: Hi list. This tool is a graphic interface for Hadoop. It may improove your productivity quite a bit, especially if you intensivelly work with files inside the HDFS. Note: In my computer it is functional but it hasn't been *yet* tested in other computers. Download: Download link: http://www.dnabaser.com/hadoop-gui-shell/index.html Please feel free to send feedback. :)
Re: About Distribute Cache
Hi, What you can fit in distributed cache generally depends on the available disk space on your nodes. With most clusters 300 mb will not be a problem, but it depends on the cluster and the workload you're processing. On Sat, Nov 14, 2009 at 10:34 PM, 于凤东 fengdon...@gmail.com wrote: I have a 300MB file, want to put to the distributed cache, but I want to know does that is a large file for ditributed cache? and normally, how many size files we put into the DC?