Hey,
I feel HashSet is a good method to dedup. To increase the overall efficiency
you could also look into Combiner running the same Reducer code. That would
ensure less data in the sort-shuffle phase.
Regards,
Matthew
On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang wrote:
> hi,harsh
> After
out under which instance (class
hierarchy) to declare such a static buffer which could be accessible
by all OutputFormat write streams.
Please help me if you ve got some idea on this.
Thanks,
Matthew John
Hi all,
I wanted to use an IO benchmark that reads/writes Data from/into the HDFS
using MapReduce. TestDFSIO. I thought, does this. But what I understand is
that TestDFSIO merely creates the files in a temp folder in the local
filesystem of the TaskTracker nodes. Is this correct? How can such an
a
npei Chen, Archana Ganapathi, Rean Griffith, Randy Katz . "SWIM
> - Statistical Workload Injector for MapReduce". Available at:
> http://www.eecs.berkeley.edu/~ychen2/SWIM.html
>
> > ------ Forwarded message --
> > From: Matthew John
> > To: common-user
Hi ,
I am looking out for Hadoop benchmarks that could characterize the following
workloads :
1) IO intensive workload
2) CPU intensive workload
3) Mixed (IO + CPU) workloads
Some one please throw some pointers on these!!
Thanks,
Matthew
Is it possible to get a Host-address to Host-name mapping in the JIP ?
Someone please help me with this!
Thanks,
Matthew
On Thu, May 12, 2011 at 5:36 PM, Matthew John wrote:
> Hi all,
>
> The String[] that is output by the InputSplit.getLocations() gives the list
> of nodes whe
Hi all,
The String[] that is output by the InputSplit.getLocations() gives the list
of nodes where the input split resides.
But the node detail is either represented as the ip-address or the hostname
(for eg - an entry in the list could be either 10.72.147.109 or mattHDFS1
(hostname). Is it possib
d, May 11, 2011 at 7:13 PM, Habermaas, William <
william.haberm...@fatwire.com> wrote:
> If the hadoop script is picking up a different hadoop-core jar then the
> classes that ipc to the NN will be using a different version.
>
> Bill
>
> -----Original Message-
> From:
version check is enforced. Make sure the new
> hadoop-core.jar from your modification is on the classpath used by the
> hadoop shell script.
>
> Bill
>
> -Original Message-
> From: Matthew John [mailto:tmatthewjohn1...@gmail.com]
> Sent: Wednesday, May 11, 2011 9:27
Hi all!
I have been trying to figure out why I m getting this error!
All that I did was :
1) Use a single node cluster
2) Made some modifications in the core (in some MapRed modules).
Successfully compiled it
3) Tried bin/start-dfs.sh alone.
All the required daemons (NN and DN) are up.
The NameN
Hi all,
I wanted to know details such as "In an MR job, which tasktracker
(node-level) works on data (inputsplit) from which datanode (node-level) ?"
Can some logs provide data on it? Or do I need to print this data - if yes,
what to print and how to print ?
Thanks,
Matthew
.
My question when is this Map updated? Is it done along
with the createNewSplits performed by the JobClient ? Where can I find the
code for these Mappings getting initialized and populated/updated ?
Thanks,
Matthew John
,
Matthew John
start-all.sh" ?
Suggestions please..
Matthew John
someone kindly give some pointers on this!!
On Mon, May 2, 2011 at 12:46 PM, Matthew John wrote:
> Any documentations on how the different daemons do the write/read on HDFS
> and Local File System (direct), I mean the different protocols used in the
> interactions. I basically wanted
ly stored on local disk.
>
> HFDS is a frail vessel that cannot cope with all the needs.
>
> On Sun, May 1, 2011 at 11:48 PM, Matthew John >wrote:
>
> > ...
> > 2) Does the Hadoop system utilize the local storage directly for any
> > purpose
> > (without going through the HDFS) in clustered mode?
> >
> >
>
system utilize the local storage directly for any purpose
(without going through the HDFS) in clustered mode?
Thanks,
Matthew John
additional time stamp outputs (for example - time stamp maybe
at the start and end of Map read).
Thanks,
Matthew John
Hi all,
Can HDFS run over a RAW DISK which is mounted over a mount point with
no FIle System ? Or does it interact only with POSIX compliant File
sytem ?
thanks,
Matthew
Can some one provide pointers/ links for DFSio Benchmarks to check the IO
performance of HDFS ??
Thanks,
Matthew John
d out ?
Or do we have a concept of input split at all (will all the maps start
scanning from the start of the input file) ?
Please help me with these queries..
Thanks,
Matthew John
run 2 mappers tasks
simultaneously, will that parallelism be preserved in Local Mode MR
too?
Matthew John
relevant split
2) breakdown of Hbase code into following modules:
- HMaster
- RegionServers
- MapReduce
- Any other relevant split
Matthew John
Hi all,
Can someone give pointers on using Iostat to account for IO overheads
(disk read/writes) in a MapReduce job.
Matthew John
Hey Manish,
I am not very sure if you have got your configurations correct
including the javapath. Can u try re-installing hadoop following the
guidelines given in the following link . That would take
care of any glitches possible.
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-li
hey Manish,
Are u giving the commands in the Hadoop_home directory ? if yes please
give "bin/hadoop namenode -format" dont forget to append "bin/" before
ur commands because all the scripts reside in the bin directory.
Matthew
On Wed, Mar 2, 2011 at 2:29 PM, Manish Yadav wrote:
> Dear Sir/Madam
. But never tried it
).
Thanks,
Matthew John
On Fri, Feb 18, 2011 at 2:54 AM, Ted Yu wrote:
> Are you investigating alternative map-reduce framework ?
>
> Please read:
> http://www.craighenderson.co.uk/mapreduce/
>
> On Thu, Feb 17, 2011 at 9:45 AM, Matthew John
> wrote:
>
Hi Ted,
Can u provide a link to the same ? Not able to find it :( .
On Thu, Feb 17, 2011 at 9:54 PM, Ted Yu wrote:
> There was a discussion thread about why hadoop was developed in Java.
> Please read it.
>
> On Wed, Feb 16, 2011 at 10:39 PM, Matthew John
> wrote:
>
>&
pointers to if there s already some work done in this
respect. Or please help me with how to proceed with the same analysis
if you feel a specific technique/software/development environment has
ready plugins to help in this regard.
thanks,
Matthew John
get development environment using C++ ?
>
> On Wed, Feb 16, 2011 at 9:49 PM, Matthew John
> wrote:
>
>> Hi all,
>>
>> I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs
>> any fixed cost of ByteCode execution. And how do the mappers (say of
&
Hi all,
I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs
any fixed cost of ByteCode execution. And how do the mappers (say of
WordCount MR) look like in detail (in bytecode detail) ?? Any good
pointers to this ?
Thanks,
Matthew John
Hi guys,
can someone send me a good documentation on Hbase (other than the
hadoop wiki). I am also looking for a good Hbase tutorial.
Regards,
Matthew
Hi Junyoung Kim,
You can try out MultipleOutputs.addNamedOutput() . The second
parameter u pass in is supposed to be the filename to be which you are
writing the reducer output. Therefore if your output folder is X
(using setOutputPath() ), you can try giving "A/output", "B/output",
"C/output" in
Hi all,
I had some doubts regarding the functioning of Hadoop MapReduce :
1) I understand that every MapReduce job is parameterized using an XML file
(with all the job configurations). So whenever I set certain parameters
using my MR code (say I set splitsize to be 32kb) it does get reflected
Hi all,
I have a BytesWritable key that comes to the mapper.
If I give key.getLength(), it returns 32.
then I tried creating a new byte [] array initializing its size to 32. (byte
[] keybytes = new bytes [32];)
and I tried giving : keybytes = key.getBytes();
now keybytes.length (which should r
Hi all,
I was working on a MapReduce program which does BytesWritable
dataprocessing. But currently I am basically running two MapReduces
consecutively to get the final output :
Input (MapReduce1)---> Intermediate (MapReduce2)---> Output
Here I am running MapReduce2 only to sort the in
Hi Jander,
If I understand what u want , u would like to run the map instances of two
different mapreduces (so obviously different mapper codes) simultaneously on
the same machine. If I am correct, it has got more to do with the number of
simultaneous mapper instances setting (I guess its default
Hi all,
Got to know about a hdfs with raid implementation from the following
documentation :
http://wiki.apache.org/hadoop/HDFS-RAID
In the documentation, it says u can find the hadoop-*-raid.jar file which
has got the libraries to run the raid-hdfs.
Where to get this file ? Searched a lot , but
//So can you guide me to write a InputFormat which splits the file
//into multiple Splits
more the number of mappers u assign , more the number of input splits in the
mapreduce..
in effect, the number of inputsplits is equal to the number of mappers
assigned.
that should take care of the problem i
ing up. Ve not used Cygwin till now..
Matthew
On Tue, Dec 14, 2010 at 9:38 AM, Harsh J wrote:
> Hi,
>
> On Tue, Dec 14, 2010 at 9:22 AM, Matthew John
> wrote:
> > Hi all,
> >
> > I have been working with Hadoop0.20.2 in linux nodes. Now I want to try
> the
&g
Hi all,
I have been working with Hadoop0.20.2 in linux nodes. Now I want to try the
same version with eclipse on a windows xp machine. Could someone provide a
tutorial/guidelines on how to install this setup.
thanks,
Matthew
Hi all,.
Is there any valid Hadoop Certification available ? Something which adds
credibility to your Hadoop expertise.
Matthew
this procedure to take much lesser time than copying to
the HDFS and then performing a CopyToLocal.. Finally I should be able to
send this file back to another node using socket communication..
Looking forward to your suggestions !!
Thanks,
Matthew John
Hi all,
I modified a MapReduce code which had only a single Input path to accomodate
Multiple Inputs..
The changes I made (in Driver file) :
Path FpdbInputPath = new Path(args[0]);
Path ClogInputPath = new Path(args[1]);
FpdbInputPath =
FpdbInputPath.makeQualified(FpdbInputPath.getFileSystem(job
Hi all,
I modified a MapReduce code which had only a single Input path to accomodate
Multiple Inputs..
The changes I made (in Driver file) :
Path FpdbInputPath = new Path(args[0]);
Path ClogInputPath = new Path(args[1]);
FpdbInputPath =
FpdbInputPath.makeQualified(FpdbInputPath.getFileSystem(job
Hi all,
The number of Reducer groups in my MapReduce is always the same as the
number of records output by the MapReduce. So what I understand is every
record from the Shuffle/Sort is going to different Reducer.reduce. How can I
change this? My key is BytesWritable and I tried writing my own Compa
Hi all ,
I was trying a mapreduce module with multiple outputs.
My reducer looks like this :
*public class JohnReducer extends MapReduceBase implements Reducer {*
*private MultipleOutputs mos;*
*
*
* public void configure (JobConf conf) {*
*mos = new MultipleOutputs(conf);*
*}*
*
*
*
*
* /**
Hi all,
I had a small doubt regarding the reduce module. What I understand is that
after the shuffle / sort phase , all the records with the same key value
goes into a reduce function. If thats the case, what is the attribute of the
Writable key which ensures that all the keys go to the same reduc
it fine if I have more than 1 set of input records (primary record
followed by the foreign records) in the same reduce phase.
For example, will this technique work if I have just one reducer running.
Regards,
Matthew John
Hi all ,
I have been recently working on a task where I need to take in two input
(types) files , compare them and produce a result from it using a logic.
But as I understand simple MapReduce implementations are for processing a
single input type. The closest implementation I could think of simi
Hi all ,
Had some doubts :
1) what happens when a mapper running in node A needs data from a block it
does nt have ? ( the block might be present in some other node in the
cluster )
2) in the Sort/Shuffle phase is just a logical representation of all map
outputs together sorted rite ? and again,
hi Maha,
try the folowing :
goto ur /current
You will find a file VERSION.. just modify the namespace id in it with your
namespace id found in the log ( in this prev post --> 200395975 ).. restart
hadoop..
(bin/start-all.sh) ...
see if all the daemons are up..
regards,
Matthew
72.147.206
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.20.3-dev
STARTUP_MSG: build = -r ; compiled by 'matthew' on Mon Sep 27 17:48:33
IST 2010
any way to fix this ?? I want the tasktracker in slave to be up and running
.
Regards ,
Matthew John
Hi all ,
I am working on a sort function and it is working perfectly fine with a
single map task.
When I give 2 map tasks, the entire data is replicated twice (sorted
output) . When giving 4 map tasks , it gives 4 times the sorted data. and so
on
I modified the Terasort for this.
Major
140
160
This is strange since the second key should ve printed out --> 0001
!!! Notice this happens only with the even no. repeating key !!
Please guide me on this !!
Regards,
Matthew John
Hey Owen,
To sum it up, I should be writing InputFormat , OutputFormat where I will be
defining my RecordReader/Writer and InputSplits. Now, why cant I use the
FpMetadata and FpMetaId I implemented as the value and key classes. Would
not that solve a lot of problem since I have defined in.readfiel
Thanks Owen for your reply !
The terasort input you have implemented is text type. And the input is line
format where as I am dealing with sequence binary file. For my requirement I
have created two writable implementables for the key and value respectively
:
*FpMetaId : key*
public class FpMeta
When it comes to Writer, I can see the append, appendRaw methods.. But the
next methods (many ! ) in Reader is confusing !.
Can you further info on it ?
Matthew
Hi guys,
I wanted to take in file with input : ..
binary sequence file (key and value length are constant) as input for the
Sort (examples) . But as I understand the data in a standard Sequencefile of
hadoop is in the format : . . Where
should I modify the code so as to use my inputfile as
of maps , reduces both as 1.
I am getting the following error :: *Task Id :
attempt_201009082009_0006_m_00_0, Status : FAILED*
*Error: Java heap space*
**Someone please throw some light on this...
thanks,
Matthew John
Hi Edward ,
Thanks for your reply.
My aim is not to generate a SequenceFile. It is to take a file (of a certain
format) and sort it. So I guess I should create a input SequenceFile from
the original file and feed it to the Sort as input. Now the output will
again be SequenceFile format and I will
er I think a file split is required
??
It would be great if someone can clarify these doubts.
Thanks,
Matthew John
Thanks for the reply Ted !!
What I understand is that a SequenceFile will have a header followed by the
records in a format : Recordlength,Keylength,Key,Value with a sync marker
coming at some regular interval..
It would be great if someone can take a look at the following..
Q 1) The thing is my
Hey Guys ! ,
Finally my examples.jar got built :) :) .. It was just a small error -> dint
initialize the package for some of the newly written files :P ..
Now i will run the command :
bin/hadoop jar hadoop-0.20.2-examples.jar sort -inFormat
org.apache.hadoop.mapred.MetafileInputFormat -outForma
which are not updated in the older version
of the core.jar. Therefore it gives ClassNotFound.
I want to make a new Example jar which depends on the new core.jar. Please
guide me on that and let me know if my understanding is wrong.
Thanks,
Matthew John
Hey Jeff ,
I gave the command :
bin/hadoop jar hadoop-0.20.2-examples.jar sort -libjars
./build/hadoop-0.20.3-dev-core.jar -inFormat
org.apache.hadoop.mapred.MetafileInputFormat -outFormat
org.apache.hadoop.mapred.MetafileOutputFormat -outKey
org.apache.hadoop.io.FpMetaId -outValue org.apache.ha
new classes I ve
included in the core are not found. I suppose thats because its using the
old hadoop-core.jar .
Thanks,
Matthew John
Thanks a lot Jeff !
The problem is that everytime I build (using ant ) there is a build folder
created. But there is no examples.jar created inside that. I wanted to add
some files into io package and mapred package. So I suppose I should put the
files appropriately ( inside io and mapred folder r
happened I guess because when I run the program , it give
ClassNotFoundException for the classes I give as parameters in the command.
Some one please help me out !! How to modify the core/ files (incorporate
more core io/mapred files) in HADOOP !!
Thanks,
Matthew John
ne can offer an helping hand to this noob.
Thanks,
Matthew John
70 matches
Mail list logo