Re: how to get all different values for each key

2011-08-03 Thread Matthew John
Hey, I feel HashSet is a good method to dedup. To increase the overall efficiency you could also look into Combiner running the same Reducer code. That would ensure less data in the sort-shuffle phase. Regards, Matthew On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang wrote: > hi,harsh > After

Global array in OutputFormat

2011-06-13 Thread Matthew John
out under which instance (class hierarchy) to declare such a static buffer which could be accessible by all OutputFormat write streams. Please help me if you ve got some idea on this. Thanks, Matthew John

IO benchmark ingesting data into HDFS

2011-06-01 Thread Matthew John
Hi all, I wanted to use an IO benchmark that reads/writes Data from/into the HDFS using MapReduce. TestDFSIO. I thought, does this. But what I understand is that TestDFSIO merely creates the files in a temp folder in the local filesystem of the TaskTracker nodes. Is this correct? How can such an a

Re: Benchmarks with different workloads

2011-05-31 Thread Matthew John
npei Chen, Archana Ganapathi, Rean Griffith, Randy Katz . "SWIM > - Statistical Workload Injector for MapReduce". Available at: > http://www.eecs.berkeley.edu/~ychen2/SWIM.html > > > ------ Forwarded message -- > > From: Matthew John > > To: common-user

Benchmarks with different workloads

2011-05-31 Thread Matthew John
Hi , I am looking out for Hadoop benchmarks that could characterize the following workloads : 1) IO intensive workload 2) CPU intensive workload 3) Mixed (IO + CPU) workloads Some one please throw some pointers on these!! Thanks, Matthew

Re: Host-address or Hostname

2011-05-12 Thread Matthew John
Is it possible to get a Host-address to Host-name mapping in the JIP ? Someone please help me with this! Thanks, Matthew On Thu, May 12, 2011 at 5:36 PM, Matthew John wrote: > Hi all, > > The String[] that is output by the InputSplit.getLocations() gives the list > of nodes whe

Host-address or Hostname

2011-05-12 Thread Matthew John
Hi all, The String[] that is output by the InputSplit.getLocations() gives the list of nodes where the input split resides. But the node detail is either represented as the ip-address or the hostname (for eg - an entry in the list could be either 10.72.147.109 or mattHDFS1 (hostname). Is it possib

Re: Bad connection to FS. command aborted

2011-05-11 Thread Matthew John
d, May 11, 2011 at 7:13 PM, Habermaas, William < william.haberm...@fatwire.com> wrote: > If the hadoop script is picking up a different hadoop-core jar then the > classes that ipc to the NN will be using a different version. > > Bill > > -----Original Message- > From:

Re: Bad connection to FS. command aborted

2011-05-11 Thread Matthew John
version check is enforced. Make sure the new > hadoop-core.jar from your modification is on the classpath used by the > hadoop shell script. > > Bill > > -Original Message- > From: Matthew John [mailto:tmatthewjohn1...@gmail.com] > Sent: Wednesday, May 11, 2011 9:27

Bad connection to FS. command aborted

2011-05-11 Thread Matthew John
Hi all! I have been trying to figure out why I m getting this error! All that I did was : 1) Use a single node cluster 2) Made some modifications in the core (in some MapRed modules). Successfully compiled it 3) Tried bin/start-dfs.sh alone. All the required daemons (NN and DN) are up. The NameN

Which datanode serves the data for MR

2011-05-09 Thread Matthew John
Hi all, I wanted to know details such as "In an MR job, which tasktracker (node-level) works on data (inputsplit) from which datanode (node-level) ?" Can some logs provide data on it? Or do I need to print this data - if yes, what to print and how to print ? Thanks, Matthew

When is Map updated for a Job

2011-05-04 Thread Matthew John
. My question when is this Map updated? Is it done along with the createNewSplits performed by the JobClient ? Where can I find the code for these Mappings getting initialized and populated/updated ? Thanks, Matthew John

bin/start-dfs/mapred.sh with input slave file

2011-05-04 Thread Matthew John
, Matthew John

Tweak the Daemon start-up

2011-05-03 Thread Matthew John
start-all.sh" ? Suggestions please.. Matthew John

Re: HDFS - MapReduce coupling

2011-05-02 Thread Matthew John
someone kindly give some pointers on this!! On Mon, May 2, 2011 at 12:46 PM, Matthew John wrote: > Any documentations on how the different daemons do the write/read on HDFS > and Local File System (direct), I mean the different protocols used in the > interactions. I basically wanted

Re: HDFS - MapReduce coupling

2011-05-02 Thread Matthew John
ly stored on local disk. > > HFDS is a frail vessel that cannot cope with all the needs. > > On Sun, May 1, 2011 at 11:48 PM, Matthew John >wrote: > > > ... > > 2) Does the Hadoop system utilize the local storage directly for any > > purpose > > (without going through the HDFS) in clustered mode? > > > > >

HDFS - MapReduce coupling

2011-05-01 Thread Matthew John
system utilize the local storage directly for any purpose (without going through the HDFS) in clustered mode? Thanks, Matthew John

Read and Write throughputs via JVM

2011-04-12 Thread Matthew John
additional time stamp outputs (for example - time stamp maybe at the start and end of Map read). Thanks, Matthew John

HDFS Compatiblity

2011-04-05 Thread Matthew John
Hi all, Can HDFS run over a RAW DISK which is mounted over a mount point with no FIle System ? Or does it interact only with POSIX compliant File sytem ? thanks, Matthew

DFSIO benchmark

2011-03-31 Thread Matthew John
Can some one provide pointers/ links for DFSio Benchmarks to check the IO performance of HDFS ?? Thanks, Matthew John

Awareness of Map tasks

2011-03-29 Thread Matthew John
d out ? Or do we have a concept of input split at all (will all the maps start scanning from the start of the input file) ? Please help me with these queries.. Thanks, Matthew John

MR in Local mode

2011-03-17 Thread Matthew John
run 2 mappers tasks simultaneously, will that parallelism be preserved in Local Mode MR too? Matthew John

Hadoop code base splits

2011-03-17 Thread Matthew John
relevant split 2) breakdown of Hbase code into following modules: - HMaster - RegionServers - MapReduce - Any other relevant split Matthew John

Iostat on Hadoop

2011-03-16 Thread Matthew John
Hi all, Can someone give pointers on using Iostat to account for IO overheads (disk read/writes) in a MapReduce job. Matthew John

Re: hadoop installation problem(single-node)

2011-03-02 Thread Matthew John
Hey Manish, I am not very sure if you have got your configurations correct including the javapath. Can u try re-installing hadoop following the guidelines given in the following link . That would take care of any glitches possible. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-li

Re: hadoop installation problem(single-node)

2011-03-02 Thread Matthew John
hey Manish, Are u giving the commands in the Hadoop_home directory ? if yes please give "bin/hadoop namenode -format" dont forget to append "bin/" before ur commands because all the scripts reside in the bin directory. Matthew On Wed, Mar 2, 2011 at 2:29 PM, Manish Yadav wrote: > Dear Sir/Madam

Re: Cost of bytecode execution in MapReduce

2011-02-17 Thread Matthew John
. But never tried it ). Thanks, Matthew John On Fri, Feb 18, 2011 at 2:54 AM, Ted Yu wrote: > Are you investigating alternative map-reduce framework ? > > Please read: > http://www.craighenderson.co.uk/mapreduce/ > > On Thu, Feb 17, 2011 at 9:45 AM, Matthew John > wrote: >

Re: Cost of bytecode execution in MapReduce

2011-02-17 Thread Matthew John
Hi Ted, Can u provide a link to the same ? Not able to find it :( . On Thu, Feb 17, 2011 at 9:54 PM, Ted Yu wrote: > There was a discussion thread about why hadoop was developed in Java. > Please read it. > > On Wed, Feb 16, 2011 at 10:39 PM, Matthew John > wrote: > >&

Mechanism of MapReduce in Hadoop

2011-02-16 Thread Matthew John
pointers to if there s already some work done in this respect. Or please help me with how to proceed with the same analysis if you feel a specific technique/software/development environment has ready plugins to help in this regard. thanks, Matthew John

Re: Cost of bytecode execution in MapReduce

2011-02-16 Thread Matthew John
get development environment using C++ ? > > On Wed, Feb 16, 2011 at 9:49 PM, Matthew John > wrote: > >> Hi all, >> >> I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs >> any fixed cost of ByteCode execution. And how do the mappers (say of &

Cost of bytecode execution in MapReduce

2011-02-16 Thread Matthew John
Hi all, I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs any fixed cost of ByteCode execution. And how do the mappers (say of WordCount MR) look like in detail (in bytecode detail) ?? Any good pointers to this ? Thanks, Matthew John

Hbase documentations

2011-02-14 Thread Matthew John
Hi guys, can someone send me a good documentation on Hbase (other than the hadoop wiki). I am also looking for a good Hbase tutorial. Regards, Matthew

Re: Could I write outputs in multiple directories?

2011-02-13 Thread Matthew John
Hi Junyoung Kim, You can try out MultipleOutputs.addNamedOutput() . The second parameter u pass in is supposed to be the filename to be which you are writing the reducer output. Therefore if your output folder is X (using setOutputPath() ), you can try giving "A/output", "B/output", "C/output" in

some doubts Hadoop MR

2011-02-10 Thread Matthew John
Hi all, I had some doubts regarding the functioning of Hadoop MapReduce : 1) I understand that every MapReduce job is parameterized using an XML file (with all the job configurations). So whenever I set certain parameters using my MR code (say I set splitsize to be 32kb) it does get reflected

Strange byte [] size conflict

2011-02-02 Thread Matthew John
Hi all, I have a BytesWritable key that comes to the mapper. If I give key.getLength(), it returns 32. then I tried creating a new byte [] array initializing its size to 32. (byte [] keybytes = new bytes [32];) and I tried giving : keybytes = key.getBytes(); now keybytes.length (which should r

Map->Reduce->Reduce

2011-01-25 Thread Matthew John
Hi all, I was working on a MapReduce program which does BytesWritable dataprocessing. But currently I am basically running two MapReduces consecutively to get the final output : Input (MapReduce1)---> Intermediate (MapReduce2)---> Output Here I am running MapReduce2 only to sort the in

Re: help for using mapreduce to run different code?

2010-12-28 Thread Matthew John
Hi Jander, If I understand what u want , u would like to run the map instances of two different mapreduces (so obviously different mapper codes) simultaneously on the same machine. If I am correct, it has got more to do with the number of simultaneous mapper instances setting (I guess its default

hdfs with raid

2010-12-22 Thread Matthew John
Hi all, Got to know about a hdfs with raid implementation from the following documentation : http://wiki.apache.org/hadoop/HDFS-RAID In the documentation, it says u can find the hadoop-*-raid.jar file which has got the libraries to run the raid-hdfs. Where to get this file ? Searched a lot , but

Re: InputFormat for a big file

2010-12-17 Thread Matthew John
//So can you guide me to write a InputFormat which splits the file //into multiple Splits more the number of mappers u assign , more the number of input splits in the mapreduce.. in effect, the number of inputsplits is equal to the number of mappers assigned. that should take care of the problem i

Re: Hadoop 0.20.2 with eclipse in windows

2010-12-13 Thread Matthew John
ing up. Ve not used Cygwin till now.. Matthew On Tue, Dec 14, 2010 at 9:38 AM, Harsh J wrote: > Hi, > > On Tue, Dec 14, 2010 at 9:22 AM, Matthew John > wrote: > > Hi all, > > > > I have been working with Hadoop0.20.2 in linux nodes. Now I want to try > the &g

Hadoop 0.20.2 with eclipse in windows

2010-12-13 Thread Matthew John
Hi all, I have been working with Hadoop0.20.2 in linux nodes. Now I want to try the same version with eclipse on a windows xp machine. Could someone provide a tutorial/guidelines on how to install this setup. thanks, Matthew

Hadoop Certification Progamme

2010-12-08 Thread Matthew John
Hi all,. Is there any valid Hadoop Certification available ? Something which adds credibility to your Hadoop expertise. Matthew

Tweaking the File write in HDFS

2010-11-14 Thread Matthew John
this procedure to take much lesser time than copying to the HDFS and then performing a CopyToLocal.. Finally I should be able to send this file back to another node using socket communication.. Looking forward to your suggestions !! Thanks, Matthew John

Multiple input not working

2010-10-20 Thread Matthew John
Hi all, I modified a MapReduce code which had only a single Input path to accomodate Multiple Inputs.. The changes I made (in Driver file) : Path FpdbInputPath = new Path(args[0]); Path ClogInputPath = new Path(args[1]); FpdbInputPath = FpdbInputPath.makeQualified(FpdbInputPath.getFileSystem(job

Multiple Input

2010-10-20 Thread Matthew John
Hi all, I modified a MapReduce code which had only a single Input path to accomodate Multiple Inputs.. The changes I made (in Driver file) : Path FpdbInputPath = new Path(args[0]); Path ClogInputPath = new Path(args[1]); FpdbInputPath = FpdbInputPath.makeQualified(FpdbInputPath.getFileSystem(job

Reduce groups

2010-10-19 Thread Matthew John
Hi all, The number of Reducer groups in my MapReduce is always the same as the number of records output by the MapReduce. So what I understand is every record from the Shuffle/Sort is going to different Reducer.reduce. How can I change this? My key is BytesWritable and I tried writing my own Compa

Multiple Output not working

2010-10-19 Thread Matthew John
Hi all , I was trying a mapreduce module with multiple outputs. My reducer looks like this : *public class JohnReducer extends MapReduceBase implements Reducer {* *private MultipleOutputs mos;* * * * public void configure (JobConf conf) {* *mos = new MultipleOutputs(conf);* *}* * * * * * /**

Reduce function

2010-10-18 Thread Matthew John
Hi all, I had a small doubt regarding the reduce module. What I understand is that after the shuffle / sort phase , all the records with the same key value goes into a reduce function. If thats the case, what is the attribute of the Writable key which ensures that all the keys go to the same reduc

Reduce side join

2010-10-18 Thread Matthew John
it fine if I have more than 1 set of input records (primary record followed by the foreign records) in the same reduce phase. For example, will this technique work if I have just one reducer running. Regards, Matthew John

Multiple Input Data Processing using MapReduce

2010-10-14 Thread Matthew John
Hi all , I have been recently working on a task where I need to take in two input (types) files , compare them and produce a result from it using a logic. But as I understand simple MapReduce implementations are for processing a single input type. The closest implementation I could think of simi

doubts

2010-10-13 Thread Matthew John
Hi all , Had some doubts : 1) what happens when a mapper running in node A needs data from a block it does nt have ? ( the block might be present in some other node in the cluster ) 2) in the Sort/Shuffle phase is just a logical representation of all map outputs together sorted rite ? and again,

Re: Easy Question

2010-10-04 Thread Matthew John
hi Maha, try the folowing : goto ur /current You will find a file VERSION.. just modify the namespace id in it with your namespace id found in the log ( in this prev post --> 200395975 ).. restart hadoop.. (bin/start-all.sh) ... see if all the daemons are up.. regards, Matthew

Incompatible build version

2010-09-27 Thread Matthew John
72.147.206 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.3-dev STARTUP_MSG: build = -r ; compiled by 'matthew' on Mon Sep 27 17:48:33 IST 2010 any way to fix this ?? I want the tasktracker in slave to be up and running . Regards , Matthew John

Input splits not working correctly

2010-09-24 Thread Matthew John
Hi all , I am working on a sort function and it is working perfectly fine with a single map task. When I give 2 map tasks, the entire data is replicated twice (sorted output) . When giving 4 map tasks , it gives 4 times the sorted data. and so on I modified the Terasort for this. Major

Strange output in modified TeraSort

2010-09-21 Thread Matthew John
140 160 This is strange since the second key should ve printed out --> 0001 !!! Notice this happens only with the even no. repeating key !! Please guide me on this !! Regards, Matthew John

Re: changing SequenceFile format

2010-09-13 Thread Matthew John
Hey Owen, To sum it up, I should be writing InputFormat , OutputFormat where I will be defining my RecordReader/Writer and InputSplits. Now, why cant I use the FpMetadata and FpMetaId I implemented as the value and key classes. Would not that solve a lot of problem since I have defined in.readfiel

Re: changing SequenceFile format

2010-09-13 Thread Matthew John
Thanks Owen for your reply ! The terasort input you have implemented is text type. And the input is line format where as I am dealing with sequence binary file. For my requirement I have created two writable implementables for the key and value respectively : *FpMetaId : key* public class FpMeta

Re: changing SequenceFile format

2010-09-13 Thread Matthew John
When it comes to Writer, I can see the append, appendRaw methods.. But the next methods (many ! ) in Reader is confusing !. Can you further info on it ? Matthew

changing SequenceFile format

2010-09-13 Thread Matthew John
Hi guys, I wanted to take in file with input : .. binary sequence file (key and value length are constant) as input for the Sort (examples) . But as I understand the data in a standard Sequencefile of hadoop is in the format : . . Where should I modify the code so as to use my inputfile as

Error: Java heap space

2010-09-09 Thread Matthew John
of maps , reduces both as 1. I am getting the following error :: *Task Id : attempt_201009082009_0006_m_00_0, Status : FAILED* *Error: Java heap space* **Someone please throw some light on this... thanks, Matthew John

Re: SequenceFile Header

2010-09-08 Thread Matthew John
Hi Edward , Thanks for your reply. My aim is not to generate a SequenceFile. It is to take a file (of a certain format) and sort it. So I guess I should create a input SequenceFile from the original file and feed it to the Sort as input. Now the output will again be SequenceFile format and I will

SequenceFile Header

2010-09-08 Thread Matthew John
er I think a file split is required ?? It would be great if someone can clarify these doubts. Thanks, Matthew John

Re: Sort with customized input/output !!

2010-09-08 Thread Matthew John
Thanks for the reply Ted !! What I understand is that a SequenceFile will have a header followed by the records in a format : Recordlength,Keylength,Key,Value with a sync marker coming at some regular interval.. It would be great if someone can take a look at the following.. Q 1) The thing is my

Re: How to rebuild Hadoop ??

2010-09-08 Thread Matthew John
Hey Guys ! , Finally my examples.jar got built :) :) .. It was just a small error -> dint initialize the package for some of the newly written files :P .. Now i will run the command : bin/hadoop jar hadoop-0.20.2-examples.jar sort -inFormat org.apache.hadoop.mapred.MetafileInputFormat -outForma

Re: How to rebuild Hadoop ??

2010-09-08 Thread Matthew John
which are not updated in the older version of the core.jar. Therefore it gives ClassNotFound. I want to make a new Example jar which depends on the new core.jar. Please guide me on that and let me know if my understanding is wrong. Thanks, Matthew John

Re: How to rebuild Hadoop ??

2010-09-08 Thread Matthew John
Hey Jeff , I gave the command : bin/hadoop jar hadoop-0.20.2-examples.jar sort -libjars ./build/hadoop-0.20.3-dev-core.jar -inFormat org.apache.hadoop.mapred.MetafileInputFormat -outFormat org.apache.hadoop.mapred.MetafileOutputFormat -outKey org.apache.hadoop.io.FpMetaId -outValue org.apache.ha

Re: How to rebuild Hadoop ??

2010-09-08 Thread Matthew John
new classes I ve included in the core are not found. I suppose thats because its using the old hadoop-core.jar . Thanks, Matthew John

Re: How to rebuild Hadoop ??

2010-09-07 Thread Matthew John
Thanks a lot Jeff ! The problem is that everytime I build (using ant ) there is a build folder created. But there is no examples.jar created inside that. I wanted to add some files into io package and mapred package. So I suppose I should put the files appropriately ( inside io and mapred folder r

How to rebuild Hadoop ??

2010-09-07 Thread Matthew John
happened I guess because when I run the program , it give ClassNotFoundException for the classes I give as parameters in the command. Some one please help me out !! How to modify the core/ files (incorporate more core io/mapred files) in HADOOP !! Thanks, Matthew John

Sort with customized input/output !!

2010-09-07 Thread Matthew John
ne can offer an helping hand to this noob. Thanks, Matthew John