Re: Reducer-side join example

2010-04-06 Thread Ed Kohlwey
Hi,
Your question has an academic sound, so I'll give it an academic answer ;).
Unfortunately, there are not really any good generalized (ie. cross join a
large matrix with a large matrix) methods for doing joins in map-reduce. The
fundamental reason for this is that in the general case you're comparing
everything to everything, and so for each pair of possible rows, you must
actually generate each pair of rows. This means every node ships all its
data to every other node, no matter what (in the general case). I bring this
up not because you're looking to optimize cross joining, but because it
demonstrates the point that you will exploit the characteristics of your
data no matter what strategy you choose, and each will have domain-specific
flaws and advantages.

The typical strategy for a reduce side join is to use hadoop's sorting
functionality to group rows by their keys, such that the entire data set for
a particular key will be resident on a single reducer. The key insight is
that you're thinking about the join as a sorting problem. Yes this means you
risk producing data sets that fill your reducers, but thats a trade-off that
you accept to reduce the complexity of the original problem.

If the existing join framework in hadoop (whose javadocs are quite thorough)
is inadequate, you shouldn't be afraid to invent, implement, and test join
strategies that are specific to your domain.


On Tue, Apr 6, 2010 at 11:01 AM, M B machac...@gmail.com wrote:

 Thanks, I appreciate the example - what happens if File A and B have many
 more columns (all different data types)?  The logic doesn't seem to work in
 that case - unless we set up the values in the Map function to include the
 file name (maybe the output value is a HashMap or something, which might
 work).

 Also, I was asking to see a reduce-side join as we have other things going
 on in the Mapper and I'm not sure if we can tweak it's output (we send
 output to multiple places).  Does anyone have an example using the
 contrib/DataJoin or something similar?

 thanks

 On Mon, Apr 5, 2010 at 7:03 PM, He Chen airb...@gmail.com wrote:

  For the Map function:
  Input key: default
  input value: File A and File B lines
 
  output key: A, B, C,(first colomn of the final result)
  output value: 12, 24, Car, 13, Van, SUV...
 
  Reduce function:
  take the Map output and do:
  for each key
  {   if the value of a key is integer
 then same it to array1;
else save it to array2
  }
  for ith element in array1
   for jth element in array2
output(key, array1[i]+\t+array2[j]);
  done
 
  Hope this helps.
 
 
  On Mon, Apr 5, 2010 at 4:10 PM, M B machac...@gmail.com wrote:
 
   Hi, I need a good java example to get me started with some joining we
  need
   to do, any examples would be appreciated.
  
   File A:
   Field1  Field2
   A12
   B13
   C22
   A24
  
   File B:
Field1  Field2   Field3
   ACar   ...
   BTruck...
   BSUV ...
   BVan  ...
  
   So, we need to first join File A and B on Field1 (say both are string
   fields).  The result would just be:
   A   12   Car   ...
   A   24   Car   ...
   B   13   Truck   ...
   B   13   SUV   ...
B   13   Van   ...
   and so on - with all the fields from both files returning.
  
   Once we have that, we sometimes need to then transform it so we have a
   single record per key (Field1):
   A (12,Car) (24,Car)
   B (13,Truck) (13,SUV) (13,Van)
   --however it looks, basically tuples for each key (we'll modify this
  later
   to return a conatenated set of fields from B, etc)
  
   At other times, instead of transforming to a single row, we just need
 to
   modify rows based on values.  So if B.Field2 equals Van, we need to
 set
   Output.Field2 = whatever then output to file ...
  
   Are there any good examples of this in native java (we can't use
   pig/hive/etc)?
  
   thanks.
  
 
 
 
  --
  Best Wishes!
 
 
  --
  Chen He
   PhD. student of CSE Dept.
  Holland Computing Center
  University of Nebraska-Lincoln
  Lincoln NE 68588
 



Re: Help on processing large amount of videos on hadoop

2009-12-22 Thread Ed Kohlwey
Hi Huazhong,
Sounds like an interesting application. Here's a few tips.

1. If the frames are not independent, you should find a way to key them
according to their order before dumping them in Hadoop so that they can be
sorted as part of your map reduce task. BTW, the video won't appear split
while its in HDFS; HDFS does use a block splitting scheme for replication
and (sort of) job distribution, but this isn't mandatory, and there's lots
of facilites to customize this behavior.
2. Is the audio needed? If not, it may make sense to preprocess the data as
key, image, where the key is something like what I mentioned above, and
the image is a custom writable that handles the data in a common format like
gif, jpg, png, whatever.
3. You'll need to have some in-depth knowledge of your video codec for this.
While I'm not an expert on video codecs, I think that many of them do their
compression by specifying key frames that have complete data, and then
representing subsequent frames as differences with a key frame. You can use
a custom input format and split to split on a frame, but you will need to be
an expert on your codec to do so. It might be easier to use a framework to
pre-transcode the data into whatever you will use for your map reduce jobs.

On Thu, Dec 17, 2009 at 2:25 PM, Huazhong Ning n...@akiira.com wrote:

 Hi,

 I set up a hadoop platform and I am going to use it to process a large
 amount of videos (each size is about 500M-1G). But I met some hard issues:
 1. The frames in each video are not independent so we may have problems if
 we split the video into blocks and distribute them in HDFS.
 2. The video is compressed but we hope the input to the map class is video
 frames. In other words we need to put the codec somewhere.
 3. Our codec (third party source code) takes video file name as input. Can
 we get the file name?

 Any suggestions and comments are welcome. Thanks a lot.

 Ning



Re: Can hadoop 0.20.1 programs runs on Amazon Elastic Mapreduce?

2009-12-16 Thread Ed Kohlwey
Last time I checked EMR only runs 0.18.3. You can use EC2 though, which
winds up being cheaper anyways.

On Wed, Dec 16, 2009 at 8:51 PM, 松柳 lamfeeli...@gmail.com wrote:

 Hi all, I'm wondering whether Amazon starts to support the newest stable
 version of Hadoop, or we can still just use 0.18.3?

 Song Liu



Re: multiple file input

2009-12-08 Thread Ed Kohlwey
One important thing to note is that, with cross products, you'll almost
always get better performance if you can fit both files on a single node's
disk rather than distributing the files.

On Tue, Dec 8, 2009 at 9:18 AM, laser08150815 la...@laserxyz.de wrote:



 pmg wrote:
 
  I am evaluating hadoop for a problem that do a Cartesian product of input
  from one file of 600K (File A) with another set of file set (FileB1,
  FileB2, FileB3) with 2 millions line in total.
 
  Each line from FileA gets compared with every line from FileB1, FileB2
  etc. etc. FileB1, FileB2 etc. are in a different input directory
 
  So
 
  Two input directories
 
  1. input1 directory with a single file of 600K records - FileA
  2. input2 directory segmented into different files with 2Million records
 -
  FileB1, FileB2 etc.
 
  How can I have a map that reads a line from a FileA in directory input1
  and compares the line with each line from input2?
 
  What is the best way forward? I have seen plenty of examples that maps
  each record from single input file and reduces into an output forward.
 
  thanks
 


 I had a similar problem and solved it by writing a custom InputFormat (see
 attachment). You should improve the methods ACrossBInputSplit.getLength ,
 ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
 --
 View this message in context:
 http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: RE: Using Hadoop in non-typical large scale user-driven environment

2009-12-02 Thread Ed Kohlwey
As far as replication goes, you should look at a project called pastry.
Apparently some people have used hadoop mapreduce on top of it. You will
need to be clever, however, in how you do your mapreduce because you
probably won't want the job to eat all the users cpu time.

On Dec 2, 2009 5:11 PM, Habermaas, William william.haberm...@fatwire.com
wrote:

Hadoop isn't going to like losing its datanodes when people shutdown their
computers.
More importantly, when the datanodes are running, your users will be
impacted by data replication. Unlike Seti, Hadoop doesn't know when the
user's screensaver is running so it will start doing things when it feels
like it.

Can someone else comment on whether HOD (hadoop-on-demand) would fit this
scenario?
Bill

-Original Message- From: Maciej Trebacz [mailto:
maciej.treb...@gmail.com] Sent: Wednesday,...


Re: New graphic interface for Hadoop - Contains: FileManager, Daemon Admin, Quick Stream Job Setup, etc

2009-11-18 Thread Ed Kohlwey
The tool looks interesting. You should consider providing the source for it.
Is it written in a language that can run on platforms besides windows?

On Nov 17, 2009 10:40 AM, Cubic cubicdes...@gmail.com wrote:

Hi list.
This tool is a graphic interface for Hadoop.
It may improove your productivity quite a bit, especially if you
intensivelly work with files inside the HDFS.

Note:
In my computer it is functional but it hasn't been *yet* tested in
other computers.

Download:
Download link: http://www.dnabaser.com/hadoop-gui-shell/index.html

Please feel free to send feedback.
:)


Re: About Distribute Cache

2009-11-15 Thread Ed Kohlwey
Hi,
What you can fit in distributed cache generally depends on the available
disk space on your nodes. With most clusters 300 mb will not be a problem,
but it depends on the cluster and the workload you're processing.

On Sat, Nov 14, 2009 at 10:34 PM, 于凤东 fengdon...@gmail.com wrote:

 I have a 300MB file, want to put to the distributed cache, but I want to
 know does that is a large file for ditributed cache? and normally, how many
 size files we put into the DC?