Generic output key class

2013-02-10 Thread Amit Sela
Hi all, Has anyone ever used some kind of a generic output key for a mapreduce job ? I have a job running multiple tasks and I want them to be able to use both Text and IntWritable as output key classes. Any suggestions ? Thanks, Amit.

RE: Question related to Decompressor interface

2013-02-10 Thread java8964 java8964
Hi, Dave: Thanks for you reply. I am not sure how the EncryptedWritable will work, can you share more ideas about it? For example, if I have a text file as my source raw file. Now I need to store it in HDFS. If I use any encryption to encrypt the whole file, then there is no good InputFormat

Confused about splitting

2013-02-10 Thread Christopher Piggott
I'm a little confused about splitting and readers. The data in my application is stored in files of google protocol buffers. There are multiple protocol buffers per file. There have been a number of simple ways to put multiple protobufs in a single file, usually involving writing some kind of

RE: Confused about splitting

2013-02-10 Thread java8964 java8964
Hi, Chris: Here is my understand about the file split and Data block. The HDFS will store your file into multi data blocks, each block will be 64M or 128M depend on your setting. Of course, the file could contain multi records. So the boundary of the record won't match with the block boundary

Re: Generic output key class

2013-02-10 Thread Sandy Ryza
Hi Amit, One way to accomplish this would be to create a custom writable implementation, TextOrIntWritable, that has fields for both. It could look something like: class TextOrIntWritable implements Writable { private boolean isText; private Text text; private IntWritable integer; void

Re: Generic output key class

2013-02-10 Thread Michael Segel
Why not just write out the int as a numeric string? On Feb 10, 2013, at 1:07 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Amit, One way to accomplish this would be to create a custom writable implementation, TextOrIntWritable, that has fields for both. It could look something like:

Re: How can I limit reducers to one-per-node?

2013-02-10 Thread Michael Segel
Adding a combiner step first then reduce? On Feb 8, 2013, at 11:18 PM, Harsh J ha...@cloudera.com wrote: Hey David, There's no readily available way to do this today (you may be interested in MAPREDUCE-199 though) but if your Job scheduler's not doing multiple-assignments on reduce

Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Jean-Marc Spaggiari
Hi, I have a quick question regarding RAID0 performances vs multiple dfs.data.dir entries. Let's say I have 2 x 2TB drives. I can configure them as 2 separate drives mounted on 2 folders and assignes to hadoop using dfs.data.dir. Or I can mount the 2 drives with RAID0 and assigned them as a

Re: Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Michael Katzenellenbogen
One thought comes to mind: disk failure. In the event a disk goes bad, then with RAID0, you just lost your entire array. With JBOD, you lost one disk. -Michael On Feb 10, 2013, at 8:58 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, I have a quick question regarding RAID0

Re: Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Jean-Marc Spaggiari
The issue is that my MB is not doing JBOD :( I have RAID only possible, and I'm fighting for the last 48h and still not able to make it work... That's why I'm thinking about using dfs.data.dir instead. I have 1 drive per node so far and need to move to 2 to reduce WIO. What will be better with

Re: Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Chris Embree
Interesting question. You'd probably need to benchmark to prove it out. I'm not the exact details of how HDFS stripes data, but it should compare pretty well to hardware RAID. Conceptually, HDFS should be able to out perform a RAID solution, since HDFS knows more about the data being written.

Re: Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Michael Katzenellenbogen
Are you able to create multiple RAID0 volumes? Perhaps you can expose each disk as its own RAID0 volume... Not sure why or where LVM comes into the picture here ... LVM is on the software layer and (hopefully) the RAID/JBOD stuff is at the hardware layer (and in the case of HDFS, LVM will only

RE: How can I limit reducers to one-per-node?

2013-02-10 Thread David Parks
I guess the FairScheduler is doing multiple assignments per heartbeat, hence the behavior of multiple reduce tasks per node even when they should otherwise be full distributed. Adding a combiner will change this behavior? Could you explain more? Thanks! David From: Michael Segel

Re: Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Jean-Marc Spaggiari
@Michael: I have done some tests between RAID0, 1, JBOD and LVM on another server. Results are there: http://www.spaggiari.org/index.php/hbase/hard-drives-performances LVM and JBOD were close, that's why I talked about LVM, since it seems to be pretty close to JBOD performance wyse and can be

HOW TO WORK WITH HADOOP SOURCE CODE

2013-02-10 Thread Dibyendu Karmakar
Hi, I am trying to view HADOOP SOURCE CODE. I am using HADOOP 1.0.3. In HADOOP distribution, only jar files are there. Give me some instruction to view source code... please I have seen contribute to hadoop page ( wiki.apache.org/hadoop/HowToContribute ) where something about git is written. I am

fresher in hadoop

2013-02-10 Thread Monkey2Code
Hi am fresher in Hadoop technologies, I want to take part in any(hive, pig) related projects( I used to be informatica developer) and start off my career . All enterprises need experienced professionals, I need your suggestions,where to find projects on big data /hadoop technologies and willing

Re: fresher in hadoop

2013-02-10 Thread Mahesh Balija
The best way is to first learn the concepts thoroughly and then if you like you can also contribute to Hadoop projects. After than prolly it is better to find some BigData based projects. Best, Mahesh Balija, CalsoftLabs. On Mon, Feb 11, 2013 at 10:32 AM, Monkey2Code monkey2c...@gmail.com wrote:

File does not exist on part-r-00000 file after reducer runs

2013-02-10 Thread David Parks
Are there any rules against writing results to Reducer.Context while in the cleanup() method? I’ve got a reducer that is downloading a few 10’s of millions of images from a set of URLs feed to it. To be efficient I run many connections in parallel, but limit connections per domain and

Re: Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Ted Dunning
Typical best practice is to have a separate file system per spindle. If you have a RAID only controller (many are), then you just create one RAID per spindle. The effect is the same. MapR is unusual able to stripe writes over multiple drives organized into a storage pool, but you will not

Re: Question related to Decompressor interface

2013-02-10 Thread Ted Dunning
All of these suggestions tend to founder on the problem of key management. What you need to do is 1) define your threats. 2) define your architecture including key management. 3) demonstrate how the architecture defends against the threat environment. I haven't seen more than a cursory

RE: How can I limit reducers to one-per-node?

2013-02-10 Thread David Parks
I tried that approach at first, one domain to one reducer, but it failed me because my data set has many domains with just a few thousand images, trivial, but we also have reasonably many massive domains with 10 million+ images. One host downloading 10 or 20 million images, while obeying

Re: Generic output key class

2013-02-10 Thread Amit Sela
If I'm running only one MapReduce job then IntWritable output is OK but if I'm running several together and some are Text output, I don't want to have duplicate MapReduce jobs for different output types, I'm trying to find a more generic solution... On Mon, Feb 11, 2013 at 3:18 AM, Michael Segel

Re: why does OldCombinerRunner pass Reporter.NULL to the combiner instead of the real reporter?

2013-02-10 Thread Jim Donofrio
I submitted MAPREDUCE-4998 with a patch On 02/07/2013 11:18 AM, Harsh J wrote: I agree its a bug if there is a discrepancy between the APIs (we are supposed to be supporting both for the time being). Please do file a JIRA with a patch - there shouldn't be any harm in re-passing the reporter