Hi all,
Has anyone ever used some kind of a generic output key for a mapreduce
job ?
I have a job running multiple tasks and I want them to be able to use both
Text and IntWritable as output key classes.
Any suggestions ?
Thanks,
Amit.
Hi, Dave:
Thanks for you reply. I am not sure how the EncryptedWritable will work, can
you share more ideas about it?
For example, if I have a text file as my source raw file. Now I need to store
it in HDFS. If I use any encryption to encrypt the whole file, then there is no
good InputFormat
I'm a little confused about splitting and readers.
The data in my application is stored in files of google protocol buffers.
There are multiple protocol buffers per file. There have been a number of
simple ways to put multiple protobufs in a single file, usually involving
writing some kind of
Hi, Chris:
Here is my understand about the file split and Data block.
The HDFS will store your file into multi data blocks, each block will be 64M or
128M depend on your setting. Of course, the file could contain multi records.
So the boundary of the record won't match with the block boundary
Hi Amit,
One way to accomplish this would be to create a custom writable
implementation, TextOrIntWritable, that has fields for both. It could look
something like:
class TextOrIntWritable implements Writable {
private boolean isText;
private Text text;
private IntWritable integer;
void
Why not just write out the int as a numeric string?
On Feb 10, 2013, at 1:07 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Amit,
One way to accomplish this would be to create a custom writable
implementation, TextOrIntWritable, that has fields for both. It could look
something like:
Adding a combiner step first then reduce?
On Feb 8, 2013, at 11:18 PM, Harsh J ha...@cloudera.com wrote:
Hey David,
There's no readily available way to do this today (you may be
interested in MAPREDUCE-199 though) but if your Job scheduler's not
doing multiple-assignments on reduce
Hi,
I have a quick question regarding RAID0 performances vs multiple
dfs.data.dir entries.
Let's say I have 2 x 2TB drives.
I can configure them as 2 separate drives mounted on 2 folders and
assignes to hadoop using dfs.data.dir. Or I can mount the 2 drives
with RAID0 and assigned them as a
One thought comes to mind: disk failure. In the event a disk goes bad,
then with RAID0, you just lost your entire array. With JBOD, you lost
one disk.
-Michael
On Feb 10, 2013, at 8:58 PM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:
Hi,
I have a quick question regarding RAID0
The issue is that my MB is not doing JBOD :( I have RAID only
possible, and I'm fighting for the last 48h and still not able to make
it work... That's why I'm thinking about using dfs.data.dir instead.
I have 1 drive per node so far and need to move to 2 to reduce WIO.
What will be better with
Interesting question. You'd probably need to benchmark to prove it out.
I'm not the exact details of how HDFS stripes data, but it should compare
pretty well to hardware RAID.
Conceptually, HDFS should be able to out perform a RAID solution, since
HDFS knows more about the data being written.
Are you able to create multiple RAID0 volumes? Perhaps you can expose
each disk as its own RAID0 volume...
Not sure why or where LVM comes into the picture here ... LVM is on
the software layer and (hopefully) the RAID/JBOD stuff is at the
hardware layer (and in the case of HDFS, LVM will only
I guess the FairScheduler is doing multiple assignments per heartbeat, hence
the behavior of multiple reduce tasks per node even when they should
otherwise be full distributed.
Adding a combiner will change this behavior? Could you explain more?
Thanks!
David
From: Michael Segel
@Michael:
I have done some tests between RAID0, 1, JBOD and LVM on another server.
Results are there:
http://www.spaggiari.org/index.php/hbase/hard-drives-performances
LVM and JBOD were close, that's why I talked about LVM, since it seems
to be pretty close to JBOD performance wyse and can be
Hi,
I am trying to view HADOOP SOURCE CODE. I am using HADOOP 1.0.3.
In HADOOP distribution, only jar files are there.
Give me some instruction to view source code... please
I have seen contribute to hadoop page (
wiki.apache.org/hadoop/HowToContribute ) where something about git is
written. I am
Hi am fresher in Hadoop technologies,
I want to take part in any(hive, pig) related projects( I used to be
informatica developer) and start off my career . All enterprises need
experienced professionals, I need your suggestions,where to find projects
on big data /hadoop technologies and willing
The best way is to first learn the concepts thoroughly and then if you like
you can also contribute to Hadoop projects.
After than prolly it is better to find some BigData based projects.
Best,
Mahesh Balija,
CalsoftLabs.
On Mon, Feb 11, 2013 at 10:32 AM, Monkey2Code monkey2c...@gmail.com wrote:
Are there any rules against writing results to Reducer.Context while in the
cleanup() method?
Ive got a reducer that is downloading a few 10s of millions of images from
a set of URLs feed to it.
To be efficient I run many connections in parallel, but limit connections
per domain and
Typical best practice is to have a separate file system per spindle. If
you have a RAID only controller (many are), then you just create one RAID
per spindle. The effect is the same.
MapR is unusual able to stripe writes over multiple drives organized into a
storage pool, but you will not
All of these suggestions tend to founder on the problem of key management.
What you need to do is
1) define your threats.
2) define your architecture including key management.
3) demonstrate how the architecture defends against the threat environment.
I haven't seen more than a cursory
I tried that approach at first, one domain to one reducer, but it failed me
because my data set has many domains with just a few thousand images,
trivial, but we also have reasonably many massive domains with 10 million+
images.
One host downloading 10 or 20 million images, while obeying
If I'm running only one MapReduce job then IntWritable output is OK but if
I'm running several together and some are Text output, I don't want to have
duplicate MapReduce jobs for different output types, I'm trying to find a
more generic solution...
On Mon, Feb 11, 2013 at 3:18 AM, Michael Segel
I submitted MAPREDUCE-4998 with a patch
On 02/07/2013 11:18 AM, Harsh J wrote:
I agree its a bug if there is a discrepancy between the APIs (we are
supposed to be supporting both for the time being). Please do file a
JIRA with a patch - there shouldn't be any harm in re-passing the
reporter
23 matches
Mail list logo