Re: how to get all different values for each key

2011-08-03 Thread Matthew John
Hey,

I feel HashSet is a good method to dedup. To increase the overall efficiency
you could also look into Combiner running the same Reducer code. That would
ensure less data in the sort-shuffle phase.

Regards,
Matthew

On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang wangjx...@gmail.com wrote:

 hi,harsh
 After map, I can get all values for one key, but I want dedup these
 values, only get all unique values. now I just do it like the image.

 I think the following code is not efficient.(using a HashSet to dedup)
 Thanks:)

 private static class MyReducer extends
 ReducerLongWritable,LongWritable,LongWritable,LongsWritable
 {
 HashSetLong uids=new HashSetLong();
  LongsWritable unique_uids=new LongsWritable();
 public void reduce(LongWritable key,IterableLongWritable values,Context
 context)throws IOException,InterruptedException
  {
 uids.clear();
 for(LongWritable v:values)
  {
 uids.add(v.get());
 }
  int size=uids.size();
 long[] l=new long[size];
 int i=0;
  for(long uid:uids)
 {
 l[i]=uid;
  i++;
 }
 unique_uids.Set(l);
  context.write(key,unique_uids);
 }
 }


 2011/8/3 Harsh J ha...@cloudera.com

 Use MapReduce :)

 If map output: (key, value)
 Then reduce input becomes: (key, [iterator of values across all maps
 with (key, value)])

 I believe this is very similar to the wordcount example, but minus the
 summing. For a given key, you get all the values that carry that key
 in the reducer. Have you tried to run a simple program to achieve this
 before asking? Or is something specifically not working?

 On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang wangjx...@gmail.com wrote:
  HI,
 I hava many key,value pairs now, and want to get all different
 values
  for each key, which way is efficient for this work.
 
such as input : 1,2 1,3 1,4 1,3 2,1 2,2
output: 1,2/3/4 2,1/2
 
Thanks!
 
  walter
 



 --
 Harsh J





Global array in OutputFormat

2011-06-13 Thread Matthew John
Hi Guys

I intend to record the Write pattern of a Job using the following
record : timestamp, size of buffer written. Inorder to obtain
this, I was thinking of maintaining a global buffer
(CollectionString) and keep adding to the buffer whenever there is
write called via the OutputFormat class.

But I am not really able to figure out under which instance (class
hierarchy) to declare such a static buffer which could be accessible
by all OutputFormat write streams.

Please help me if you ve got some idea on this.

Thanks,
Matthew John


Re: Benchmarks with different workloads

2011-06-01 Thread Matthew John
I am looking for a compute intensive benchmark (cpu usage  60% ) for my
hadoop cluster. If there is something readily available, it would be great.

Thanks,
Matthew

On Tue, May 31, 2011 at 8:30 PM, Cristina Abad cristina.a...@gmail.comwrote:

 You could try SWIM [1].

 -Cristina

 [1] Yanpei Chen, Archana Ganapathi, Rean Griffith, Randy Katz . SWIM
 - Statistical Workload Injector for MapReduce. Available at:
 http://www.eecs.berkeley.edu/~ychen2/SWIM.html

  -- Forwarded message --
  From: Matthew John tmatthewjohn1...@gmail.com
  To: common-user common-user@hadoop.apache.org
  Date: Tue, 31 May 2011 20:01:25 +0530
  Subject: Benchmarks with different workloads
  Hi ,
 
  I am looking out for Hadoop benchmarks that could characterize the
 following
  workloads :
 
  1) IO intensive workload
 
  2) CPU intensive workload
 
  3) Mixed (IO + CPU) workloads
 
  Some one please throw some pointers on these!!
 
  Thanks,
  Matthew
 
 



IO benchmark ingesting data into HDFS

2011-06-01 Thread Matthew John
Hi all,

I wanted to use an IO benchmark that reads/writes Data from/into the HDFS
using MapReduce. TestDFSIO. I thought, does this. But what I understand is
that TestDFSIO merely creates the files in a temp folder in the local
filesystem of the TaskTracker nodes. Is this correct? How can such an
approach test the IOPs given by a IO intensive MapReduce workload ?

Matthew


Benchmarks with different workloads

2011-05-31 Thread Matthew John
Hi ,

I am looking out for Hadoop benchmarks that could characterize the following
workloads :

1) IO intensive workload

2) CPU intensive workload

3) Mixed (IO + CPU) workloads

Some one please throw some pointers on these!!

Thanks,
Matthew


Host-address or Hostname

2011-05-12 Thread Matthew John
Hi all,

The String[] that is output by the InputSplit.getLocations() gives the list
of nodes where the input split resides.
But the node detail is either represented as the ip-address or the hostname
(for eg - an entry in the list could be either 10.72.147.109 or mattHDFS1
(hostname). Is it possible to make this consistent. I am trying to do some
work by parsing an ID number embedded in the Hostname and this mixed
representation is giving me hell lot of problems.

How to resolve this ?

Thanks,
Matthew


Re: Host-address or Hostname

2011-05-12 Thread Matthew John
Is it possible to get a Host-address to Host-name mapping in the JIP ?
Someone please help me with this!

Thanks,
Matthew

On Thu, May 12, 2011 at 5:36 PM, Matthew John tmatthewjohn1...@gmail.comwrote:

 Hi all,

 The String[] that is output by the InputSplit.getLocations() gives the list
 of nodes where the input split resides.
 But the node detail is either represented as the ip-address or the hostname
 (for eg - an entry in the list could be either 10.72.147.109 or mattHDFS1
 (hostname). Is it possible to make this consistent. I am trying to do some
 work by parsing an ID number embedded in the Hostname and this mixed
 representation is giving me hell lot of problems.

 How to resolve this ?

 Thanks,
 Matthew



Bad connection to FS. command aborted

2011-05-11 Thread Matthew John
Hi all!

I have been trying to figure out why I m getting this error!

All that I did was :
1) Use a single node cluster
2) Made some modifications in the core (in some MapRed modules).
Successfully compiled it
3) Tried bin/start-dfs.sh alone.

All the required daemons (NN and DN) are up.
The NameNode and DataNode logs are nt showing any errors/exceptions.

Only interesting thing I found was :
*WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch
from 10.72.147.109:40048 got version 94 expected version 3  *
in the NameNode logs.

Someone please help me out of this!

Matthew


Re: Bad connection to FS. command aborted

2011-05-11 Thread Matthew John
I did a ant jar after modifying two files in the mapred module. This, from
what I understand, creates a hadoop-*-core.jar in the build folder. Now I
assume that will be used henceforth for any execution. So how can this be a
problem if I am running a single-node cluster. Version mismatch with whom ?

On Wed, May 11, 2011 at 7:07 PM, Habermaas, William 
william.haberm...@fatwire.com wrote:

 The Hadoop IPCs are version specific.  That is done to prevent an older
 version from talking to a newer one.  Even if nothing has changed in the
 internal protocols the version check is enforced.  Make sure the new
 hadoop-core.jar from your modification is on the classpath used by the
 hadoop shell script.

 Bill

 -Original Message-
 From: Matthew John [mailto:tmatthewjohn1...@gmail.com]
 Sent: Wednesday, May 11, 2011 9:27 AM
 To: common-user
 Subject: Bad connection to FS. command aborted

 Hi all!

 I have been trying to figure out why I m getting this error!

 All that I did was :
 1) Use a single node cluster
 2) Made some modifications in the core (in some MapRed modules).
 Successfully compiled it
 3) Tried bin/start-dfs.sh alone.

 All the required daemons (NN and DN) are up.
 The NameNode and DataNode logs are nt showing any errors/exceptions.

 Only interesting thing I found was :
 *WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch
 from 10.72.147.109:40048 got version 94 expected version 3  *
 in the NameNode logs.

 Someone please help me out of this!

 Matthew



Re: Bad connection to FS. command aborted

2011-05-11 Thread Matthew John
 org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 54310: starting
2011-05-11 19:34:36,622 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54310: starting
2011-05-11 19:34:36,631 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 54310: starting
2011-05-11 19:34:36,639 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 54310: starting
2011-05-11 19:34:36,640 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 54310: starting
2011-05-11 19:34:36,655 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 54310: starting
2011-05-11 19:34:36,656 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 54310: starting
2011-05-11 19:34:36,658 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 54310: starting
2011-05-11 19:34:36,658 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 54310: starting
2011-05-11 19:34:36,669 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 54310: starting
2011-05-11 19:37:36,548 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration from
10.72.147.109:50010storage
DS-1515207802-10.72.147.109-50010-1305118592183
2011-05-11 19:37:36,551 INFO org.apache.hadoop.net.NetworkTopology: Adding a
new node: /default-rack/10.72.147.109:50010

Thanks,
Matthew

On Wed, May 11, 2011 at 7:13 PM, Habermaas, William 
william.haberm...@fatwire.com wrote:

 If the hadoop script is picking up a different hadoop-core jar then the
 classes that ipc to the NN will be using a different version.

 Bill

 -Original Message-
 From: Matthew John [mailto:tmatthewjohn1...@gmail.com]
 Sent: Wednesday, May 11, 2011 9:41 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Bad connection to FS. command aborted

 I did a ant jar after modifying two files in the mapred module. This,
 from
 what I understand, creates a hadoop-*-core.jar in the build folder. Now I
 assume that will be used henceforth for any execution. So how can this be a
 problem if I am running a single-node cluster. Version mismatch with whom ?

 On Wed, May 11, 2011 at 7:07 PM, Habermaas, William 
 william.haberm...@fatwire.com wrote:

  The Hadoop IPCs are version specific.  That is done to prevent an older
  version from talking to a newer one.  Even if nothing has changed in the
  internal protocols the version check is enforced.  Make sure the new
  hadoop-core.jar from your modification is on the classpath used by the
  hadoop shell script.
 
  Bill
 
  -Original Message-
  From: Matthew John [mailto:tmatthewjohn1...@gmail.com]
  Sent: Wednesday, May 11, 2011 9:27 AM
  To: common-user
  Subject: Bad connection to FS. command aborted
 
  Hi all!
 
  I have been trying to figure out why I m getting this error!
 
  All that I did was :
  1) Use a single node cluster
  2) Made some modifications in the core (in some MapRed modules).
  Successfully compiled it
  3) Tried bin/start-dfs.sh alone.
 
  All the required daemons (NN and DN) are up.
  The NameNode and DataNode logs are nt showing any errors/exceptions.
 
  Only interesting thing I found was :
  *WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch
  from 10.72.147.109:40048 got version 94 expected version 3  *
  in the NameNode logs.
 
  Someone please help me out of this!
 
  Matthew
 



Which datanode serves the data for MR

2011-05-09 Thread Matthew John
Hi all,

I wanted to know details such as In an MR job, which tasktracker
(node-level) works on data (inputsplit) from which datanode (node-level) ?
Can some logs provide data on it? Or do I need to print this data - if yes,
what to print and how to print ?

Thanks,
Matthew


bin/start-dfs/mapred.sh with input slave file

2011-05-04 Thread Matthew John
Hi all,

I see that there is an option to provide a slaves_file as input to
bin/start-dfs.sh and bin/start-mapred.sh so that slaves are parsed from this
input file rather than the default conf/slaves.

Can someone please help me with the syntax for this. I am not able to figure
this out.

Thanks,
Matthew John


Tweak the Daemon start-up

2011-05-03 Thread Matthew John
Hi all,

Assume I have got (m+n = p) p nodes (excluding the NameNode) in a hadoop
cluster. I wanted to initialize the cluster with TaskTracker alone running
on m nodes and DataNode alone running on the rest n nodes. How can I achieve
such a configuration ? Can I do this by modifying the bin/start-all.sh ?
Suggestions please..

Matthew John


Re: HDFS - MapReduce coupling

2011-05-02 Thread Matthew John
someone kindly give some pointers on this!!

On Mon, May 2, 2011 at 12:46 PM, Matthew John tmatthewjohn1...@gmail.comwrote:

 Any documentations on how the different daemons do the write/read on HDFS
 and Local File System (direct), I mean the different protocols used in the
 interactions. I basically wanted to figure out how intricate the coupling
 between the Storage (HDFS + Local) and other processes in the Hadoop
 infrastructure is.



 On Mon, May 2, 2011 at 12:26 PM, Ted Dunning tdunn...@maprtech.comwrote:

 Yes.  There is quite a bit of need for the local file system in clustered
 mode.

 For one think, all of the shuffle intermediate files are on local disk.
  For
 another, the distributed cache is actually stored on local disk.

 HFDS is a frail vessel that cannot cope with all the needs.

 On Sun, May 1, 2011 at 11:48 PM, Matthew John tmatthewjohn1...@gmail.com
 wrote:

  ...
  2) Does the Hadoop system utilize the local storage directly for any
  purpose
  (without going through the HDFS) in clustered mode?
 
 





Read and Write throughputs via JVM

2011-04-13 Thread Matthew John
Hi all,

I wanted to figure out the Read and Write throughputs that happens in
a Map task (Read - reading from the input splits, Write - writing the
map output back) inside a JVM. Do we have any counters that can help
me with this? Or where exactly should I focus on tweaking the code to
add some additional time stamp outputs (for example - time stamp maybe
at the start and end of Map read).

Thanks,
Matthew John


HDFS Compatiblity

2011-04-05 Thread Matthew John
Hi all,

Can HDFS run over a RAW DISK which is mounted over a mount point with
no FIle System ? Or does it interact only with POSIX compliant File
sytem ?

thanks,
Matthew


DFSIO benchmark

2011-03-31 Thread Matthew John
Can some one provide pointers/ links for DFSio Benchmarks to check the IO
performance of HDFS ??

Thanks,
Matthew John


Awareness of Map tasks

2011-03-30 Thread Matthew John
Hi all,

Had some queries on Map task's awareness. From what I understand,
every map task instance is destined to process the data in a specific
Input split (can be across HDFS blocks).

1) Do these map tasks have a unique instance number? If yes, are they
mapped to its specific input splits and the mapping is done using what
parameters (say for eg. map task number to input file byte offset ?) ?
where exactly is this hash-map preserved (at what level - jobtracker,
tasktracker or each tasks) ?

2) coming to a practical scenario, when I run hadoop in local mode. I
run a mapreduce job with 10 maps. Since there is an inherent jvm
parallelism (say the node can afford to run 2 map task jvms
simultaneously) I assume that there are some map tasks that run
concurrently. Since HDFS doesnot play a role in this case, how is the
map task instance - to - input split mapping mechanism carried out ?
Or do we have a concept of input split at all (will all the maps start
scanning from the start of the input file) ?

Please help me with these queries..

Thanks,
Matthew John


Hadoop code base splits

2011-03-17 Thread Matthew John
Hi,

Can someone provide me some pointers on the following details of
Hadoop code base:

1) breakdown of HDFS code base (approximate lines of code) into
following modules:
 - HDFS at the Datanodes
 - Namenode
 - Zookeeper
 - MapReduce based
 - Any other relevant split

2) breakdown of Hbase code into following modules:
 - HMaster
 - RegionServers
 - MapReduce
 - Any other relevant split

Matthew John


Iostat on Hadoop

2011-03-16 Thread Matthew John
Hi all,

Can someone give pointers on using Iostat to account for IO overheads
(disk read/writes) in a MapReduce job.

Matthew John


Re: hadoop installation problem(single-node)

2011-03-02 Thread Matthew John
hey Manish,

Are u giving the commands in the Hadoop_home directory ? if yes please
give bin/hadoop namenode -format dont forget to append bin/ before
ur commands because all the scripts reside in the bin directory.

Matthew

On Wed, Mar 2, 2011 at 2:29 PM, Manish Yadav manish.ya...@orkash.com wrote:
 Dear Sir/Madam
  I'm very new to hadoop. I'm trying to install hadoop on my computer. I
 followed a weblink and try to install it. I want to install hadoop on my
 single node cluster.
 i 'm using Ubuntu 10.04 64-bit as my operating system . I have installed
 java in /usr/java/jdk1.6.0_24. the step i take to install hadoop are
 following

 1: Make a group hadoop and a user hadoop with home directory
 in hadoop directory i have a directory called projects and download hadoop
 binary there than extract them there;
 i configured the ssh also.

 than i made changes to some file which are following. i'm attaching them
 with this male please check them .
 1: hadoop_env_sh
 2:core-site.xml
 3mapreduce-site.xml
 4 hdfs-site. xml
 5 hadoop's usre .bashrc
 6 hadoop'user .profile

  After making changes to these fie ,I just enter the hadoop account and
 enter the  few command following thing happen :

 hadoop@ws40-man-lin:~$ echo $HADOOP_HOME
 /home/hadoop/project/hadoop-0.20.0
 hadoop@ws40-man-lin:~$ hadoop namenode -format
 hadoop: command not found
 hadoop@ws40-man-lin:~$ namenode -format
 namenode: command not found
 hadoop@ws40-man-lin:~$

 now I'm completely stuck i don't know what to do? please help me as there is
 no more help around the net.
 i' m attaching the files also which i changed can u tell me the exact
 configuration which i should use to install hadoop.





Re: hadoop installation problem(single-node)

2011-03-02 Thread Matthew John
Hey Manish,

I am not very sure if you have got your configurations correct
including the javapath. Can u try re-installing hadoop following the
guidelines given in the following link step by step. That would take
care of any glitches possible.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Thanks,
Matthew

On Thu, Mar 3, 2011 at 10:42 AM, manish.yadav manish.ya...@orkash.com wrote:

 thanks for the help now the command is working but I got the  following
 errors .Will u help me to solve these error
 im giving you the error list which i faced in installing hadoop on single
 node cluster all the configuration files are attached to the earlier post
 i just use the command
 hadoop@ws40-man-lin:~/project/hadoop-0.20.0$ bin/hadoop namenode -format
 and i get following result
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/hdfs/server/namenode/NameNode
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.server.namenode.NameNode
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
 Could not find the main class:
 org.apache.hadoop.hdfs.server.namenode.NameNode.  Program will exit.
 now what i'm doing wrong



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/hadoop-installation-problem-single-node-tp2613742p2623014.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: Cost of bytecode execution in MapReduce

2011-02-17 Thread Matthew John
Hi Ted,

Can u provide a link to the same ? Not able to find it :( .


On Thu, Feb 17, 2011 at 9:54 PM, Ted Yu yuzhih...@gmail.com wrote:
 There was a discussion thread about why hadoop was developed in Java.
 Please read it.

 On Wed, Feb 16, 2011 at 10:39 PM, Matthew John
 tmatthewjohn1...@gmail.comwrote:

 hi Ted,
 wanted to know if its development environment specific. Can u throw
 some light on whether there is any inherent bytecode ececution cost ?
 I am not using any specific development environment now (like Eclipse)
 .

 Matthew

 On Thu, Feb 17, 2011 at 11:52 AM, Ted Yu yuzhih...@gmail.com wrote:
  Is your target development environment using C++ ?
 
  On Wed, Feb 16, 2011 at 9:49 PM, Matthew John 
 tmatthewjohn1...@gmail.comwrote:
 
  Hi all,
 
  I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs
  any fixed cost of ByteCode execution. And how do the mappers (say of
  WordCount MR) look like in detail (in bytecode detail) ?? Any good
  pointers to this ?
 
  Thanks,
  Matthew John
 
 




Re: Cost of bytecode execution in MapReduce

2011-02-17 Thread Matthew John
Hi Ted,

I want to basically analyse the cost functions of the MapReduce
framework in Hadoop. That would include a good understanding of the
byte code execution costs which comes with Mappers and Reducers. I
know it might change for different MRs. So I am thinking of taking
WordCount and analysing it in-depth. The intention is trying to
optimize / tweak MapReduce for different workloads and commodity
resources.

It will be great if someone can provide some links to work already
done. Or help me with some framework which enables bytecode level
analysis ( I guess Eclipse could be a good option. But never tried it
).

Thanks,
Matthew John

On Fri, Feb 18, 2011 at 2:54 AM, Ted Yu yuzhih...@gmail.com wrote:
 Are you investigating alternative map-reduce framework ?

 Please read:
 http://www.craighenderson.co.uk/mapreduce/

 On Thu, Feb 17, 2011 at 9:45 AM, Matthew John 
 tmatthewjohn1...@gmail.comwrote:

 Hi Ted,

 Can u provide a link to the same ? Not able to find it :( .


 On Thu, Feb 17, 2011 at 9:54 PM, Ted Yu yuzhih...@gmail.com wrote:
  There was a discussion thread about why hadoop was developed in Java.
  Please read it.
 
  On Wed, Feb 16, 2011 at 10:39 PM, Matthew John
  tmatthewjohn1...@gmail.comwrote:
 
  hi Ted,
  wanted to know if its development environment specific. Can u throw
  some light on whether there is any inherent bytecode ececution cost ?
  I am not using any specific development environment now (like Eclipse)
  .
 
  Matthew
 
  On Thu, Feb 17, 2011 at 11:52 AM, Ted Yu yuzhih...@gmail.com wrote:
   Is your target development environment using C++ ?
  
   On Wed, Feb 16, 2011 at 9:49 PM, Matthew John 
  tmatthewjohn1...@gmail.comwrote:
  
   Hi all,
  
   I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs
   any fixed cost of ByteCode execution. And how do the mappers (say of
   WordCount MR) look like in detail (in bytecode detail) ?? Any good
   pointers to this ?
  
   Thanks,
   Matthew John
  
  
 
 




Cost of bytecode execution in MapReduce

2011-02-16 Thread Matthew John
Hi all,

I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs
any fixed cost of ByteCode execution. And how do the mappers (say of
WordCount MR) look like in detail (in bytecode detail) ?? Any good
pointers to this ?

Thanks,
Matthew John


Re: Cost of bytecode execution in MapReduce

2011-02-16 Thread Matthew John
hi Ted,
wanted to know if its development environment specific. Can u throw
some light on whether there is any inherent bytecode ececution cost ?
I am not using any specific development environment now (like Eclipse)
.

Matthew

On Thu, Feb 17, 2011 at 11:52 AM, Ted Yu yuzhih...@gmail.com wrote:
 Is your target development environment using C++ ?

 On Wed, Feb 16, 2011 at 9:49 PM, Matthew John 
 tmatthewjohn1...@gmail.comwrote:

 Hi all,

 I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs
 any fixed cost of ByteCode execution. And how do the mappers (say of
 WordCount MR) look like in detail (in bytecode detail) ?? Any good
 pointers to this ?

 Thanks,
 Matthew John




Mechanism of MapReduce in Hadoop

2011-02-16 Thread Matthew John
Hi all,

I want to know if anyone had already done an in-depth analysis of the
MapReduce mechanism. Has anyone really gone into bytecode level
understanding of the Map and Reduce mechanism. It would be good if we
can take a simple MapReduce (say WordCount) and then try the analysis.
Please send me pointers to if there s already some work done in this
respect. Or please help me with how to proceed with the same analysis
if you feel a specific technique/software/development environment has
ready plugins to help in this regard.

thanks,
Matthew John


Hbase documentations

2011-02-14 Thread Matthew John
Hi guys,

can someone send me a good documentation on Hbase (other than the
hadoop wiki). I am also looking for a good Hbase tutorial.

Regards,
Matthew


Re: Could I write outputs in multiple directories?

2011-02-13 Thread Matthew John
Hi Junyoung Kim,

You can try out MultipleOutputs.addNamedOutput() . The second
parameter u pass in is supposed to be the filename to be which you are
writing the reducer output. Therefore if your output folder is X
(using setOutputPath() ), you can try giving A/output, B/output,
C/output in the 2nd parameter space. It should write the
corresponding data to X/A/output , X/B/output and X/C/output
respectively I guess.

In the reducer, depending on the key , you can use getCollector() to
write it to different output paths.
For eg:
if (Key == A)
multipleoutputs.getCollector(A/output,reporter).collect(Key,Value);

Regards,
Matthew

On Mon, Feb 14, 2011 at 11:27 AM, Jun Young Kim juneng...@gmail.com wrote:

 Hi,

 As I understand, a Hadoop can write multiple files in a directory.
 but, it can't write output files in multiple directories. isn't it?


 MultipleOutputs for generating multiple files.
 FileInputFormat.addInputPaths for setting several input files simultaneously.

 How could I do if I want to write outputs files in multiple directories 
 depends on it's key?

 for example)
 A type key - MMdd/A/output
 B type Key - MMdd/B/output
 C type Key - MMdd/C/output

 thanks.

 --
 Junyoung Kim (juneng...@gmail.com)



some doubts Hadoop MR

2011-02-10 Thread Matthew John
Hi all,

I had some doubts regarding the functioning of Hadoop MapReduce :

1) I understand that every MapReduce job is parameterized using an XML file
(with all the job configurations). So whenever I set certain parameters
using my MR code (say I set splitsize to be 32kb) it does get reflected
in the job (number of mappers). How exactly does that happen ? Does the
parameters coded in the MR module override the default parameters set in the
configuration XML ? And how does the JobTracker ensure that the
configuration is followed by all the TaskTrackers ? What is the mechanism
followed ?

2) Assume I am running cascading (chained) MR modules. In this case I feel
there is a huge overhead when output of MR1 is written back to HDFS and then
read from there as input of MR2.Can this be avoided ? (maybe store it in
some memory without hitting the HDFS and NameNode ) Please let me know if
there s some means of exercising this because it will increase the
efficiency of chained MR to a great extent.

Matthew


Strange byte [] size conflict

2011-02-02 Thread Matthew John
Hi all,

I have a BytesWritable key that comes to the mapper.

If I give key.getLength(), it returns 32.

then I tried creating a new byte [] array initializing its size to 32. (byte
[] keybytes = new bytes [32];)

and I tried giving : keybytes = key.getBytes();

now keybytes.length (which should return 32) is returning 48 !

I dont understand why this is happening ! Please help me with this. !

Thanks,
Matthew


Map-Reduce-Reduce

2011-01-25 Thread Matthew John
Hi all,


I was working on a MapReduce program which does BytesWritable
dataprocessing. But currently I am basically running two MapReduces
consecutively to get the final output :

Input  (MapReduce1)--- Intermediate (MapReduce2)--- Output

Here I am running MapReduce2 only to sort the intermediate data on the basis
of a Key comparator logic.

I wanted to cut short the number of MapReduces to just one. I have figured
out a logic to do the same. But the only problem is that in my  logic I need
to run a sort on the Reduce output to get the  final output. the flow looks
like this :

Input (MapReduce1) Output (not sorted)

I want to know if its possible to attach one more Reduce module to the
dataflow so that it can perform the inherent sort before the 2nd reduce
call. It would look like :

Input --(Map)--- MapOutput ---(Reduce1)--Output (not sorted) ---(Reduce2 -
for which Reduce 1 acts as a Mapper)--- Output

Please let me know  if  there can be some means of sorting the output
without invoking a separate MapReduce just for the sake of sorting it .

Thanks ,
Matthew


Re: help for using mapreduce to run different code?

2010-12-28 Thread Matthew John
Hi Jander,

If I understand what u want , u would like to run the map instances of two
different mapreduces (so obviously different mapper codes) simultaneously on
the same machine. If I am correct, it has got more to do with the number of
simultaneous mapper instances setting (I guess its default 2 or 4). And
there should be a way to divide the map instances among the two MR modules
(to fill up the slot of 4)u want to run together. Please correct me if I am
wrong. Wanted to try clearing the air regarding the Query :) :) .

Matthew

On Wed, Dec 29, 2010 at 5:47 AM, maha m...@umail.ucsb.edu wrote:

 Hi Jander,

   You mean write Map in another language?  like python or C, then yes.
 Check this http://hadoop.apache.org/common/docs/r0.18.0/streaming.html for
 Hadoop Streaming.

 Maha

 On Dec 28, 2010, at 2:53 PM, Jander g wrote:

  Hi, all
 
  Whether Hadoop supports the map function running different code? If yes,
 how
  to realize this?
 
  Thanks in advance!
 
  --
  Regards,
  Jander




hdfs with raid

2010-12-22 Thread Matthew John
Hi all,

Got to know about a hdfs with raid implementation from the following
documentation :
http://wiki.apache.org/hadoop/HDFS-RAID

In the documentation, it says u can find the hadoop-*-raid.jar file which
has got the libraries to run the raid-hdfs.
Where to get this file ? Searched a lot , but could not get my hands on it
..

Thanks,
Matthew


Re: InputFormat for a big file

2010-12-17 Thread Matthew John
//So can you guide me to write a InputFormat which splits the file
//into multiple Splits
more the number of mappers u assign , more the number of input splits in the
mapreduce..
in effect, the number of inputsplits is equal to the number of mappers
assigned.

that should take care of the problem i guess...

Matthew




On Fri, Dec 17, 2010 at 9:28 PM, madhu phatak phatak@gmail.com wrote:

 Hi
 I have a very large file of size 1.4 GB. Each line of the file is a number
 .
 I want to find the sum all those numbers.
 I wanted to use NLineInputFormat as a InputFormat but it sends only one
 line
 to the Mapper which is very in efficient.
 So can you guide me to write a InputFormat which splits the file
 into multiple Splits and each mapper can read multiple
 line from each split

 Regards
 Madhukar



Hadoop 0.20.2 with eclipse in windows

2010-12-13 Thread Matthew John
Hi all,

 I have been working with Hadoop0.20.2 in linux nodes. Now I want to try the
same version with eclipse on a windows xp machine. Could someone provide a
tutorial/guidelines on how to install this setup.

thanks,
Matthew


Re: Hadoop 0.20.2 with eclipse in windows

2010-12-13 Thread Matthew John
I tried installing using this link, but as in the tutorial when I try to run
bin/hadoop namenode -format
it gives the following error :

bin/hadoop: line 2 : $'\r' : command not found
and many such statements..

I ve given the local jdk folder as the java_home.
Not sure why this is showing up. Ve not used Cygwin till now..

Matthew

On Tue, Dec 14, 2010 at 9:38 AM, Harsh J qwertyman...@gmail.com wrote:

 Hi,

 On Tue, Dec 14, 2010 at 9:22 AM, Matthew John
 tmatthewjohn1...@gmail.com wrote:
  Hi all,
 
   I have been working with Hadoop0.20.2 in linux nodes. Now I want to try
 the
  same version with eclipse on a windows xp machine. Could someone provide
 a
  tutorial/guidelines on how to install this setup.

 This page's instruction still works for running a Hadoop cluster on
 Windows + the Plugin w/ Cygwin:
 http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html

 
  thanks,
  Matthew
 



 --
 Harsh J
 www.harshj.com



Hadoop Certification Progamme

2010-12-08 Thread Matthew John
Hi all,.

Is there any valid Hadoop Certification available ? Something which adds
credibility to your Hadoop expertise.

Matthew


Tweaking the File write in HDFS

2010-11-14 Thread Matthew John
Hi all ,

I have been working with MapReduce and HDFS for sometime. So the procedure
what I normally follow is :

1) copy in the input file from Local File System to HDFS

2) run the map reduce module

3) copy the output file back to the Local File System from the HDFS

But I feel , step 1 and 3 is  adding a lot of overhead to the entire process
!!

My queries are :

1) I am getting the files into the Local File System by establishing a port
connection with another node. So can I ensure that the data which is ported
into the hadoop node is directly written to the HDFS instead of going
through the Local File System and then performing a CopyFromLocal ???

2) Can I copy the reduce output (which creates the final output file)
directly to the Local File System instead of injecting it to the HDFS
(effectively into different nodes in HDFS), so that I can minimize the
overhead ?? I expect this procedure to take much lesser time than copying to
the HDFS and then performing a CopyToLocal.. Finally I should be able to
send this file back to another node using socket communication..

Looking forward to your suggestions !!

Thanks,

Matthew John


Multiple Input

2010-10-20 Thread Matthew John
Hi all,

I modified a MapReduce code which had only a single Input path to accomodate
Multiple Inputs..

The changes I made (in Driver file) :

Path FpdbInputPath = new Path(args[0]);
Path ClogInputPath = new Path(args[1]);
FpdbInputPath =
FpdbInputPath.makeQualified(FpdbInputPath.getFileSystem(job));
ClogInputPath =
ClogInputPath.makeQualified(ClogInputPath.getFileSystem(job));
MultipleInputs.addInputPath(job, FpdbInputPath, Dup1InputFormat.class,
Dup1FpdbMapper.class);
MultipleInputs.addInputPath(job, ClogInputPath, Dup1InputFormat.class,
Dup1ClogMapper.class);

But when I run the program it is giving the exception :
java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152)
and so on.

Is it because there is a default input directory set and when it finds there
is nothing there it gives out the error ??

Please help me out of this..

Thanks ,

Matthew


Multiple input not working

2010-10-20 Thread Matthew John
Hi all,

I modified a MapReduce code which had only a single Input path to accomodate
Multiple Inputs..

The changes I made (in Driver file) :

Path FpdbInputPath = new Path(args[0]);
Path ClogInputPath = new Path(args[1]);
FpdbInputPath =
FpdbInputPath.makeQualified(FpdbInputPath.getFileSystem(job));
ClogInputPath =
ClogInputPath.makeQualified(ClogInputPath.getFileSystem(job));
MultipleInputs.addInputPath(job, FpdbInputPath, Dup1InputFormat.class,
Dup1FpdbMapper.class);
MultipleInputs.addInputPath(job, ClogInputPath, Dup1InputFormat.class,
Dup1ClogMapper.class);

But when I run the program it is giving the exception :
java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152)
and so on.

Is it because there is a default input directory set and when it finds there
is nothing there it gives out the error ??

Please help me out of this..

Note : I am using Hadoop - 0.20.2 .. Is it by anyways because of the version
that MultipleInputs is not updating the map.input.dir with the paths added ?

Thanks ,

Matthew


Reduce groups

2010-10-19 Thread Matthew John
Hi all,

The number of Reducer groups in my MapReduce is always the same as the
number of records output by the MapReduce. So what I understand is every
record from the Shuffle/Sort is going to different Reducer.reduce. How can I
change this? My key is BytesWritable and I tried writing my own Comparator
and set it in setOutputValueGroupingClass but still not more than one record
is entering the same reduce group. Someone please tell me the mechanism
behind this so that I can fix this problem . I am not caring about
Partitioner since I am using a single reducer.

Thanks,

Matthew


Multiple Input Data Processing using MapReduce

2010-10-14 Thread Matthew John
Hi all ,

 I have been recently working on a task where I need to take in two input
(types)  files , compare them and produce a result from it using a logic.
But as I understand simple MapReduce implementations are for processing a
single input type. The closest implementation I could think of similar to my
work is Join MapReduce. But I am not able to understand much from the
example provided in Hadoop .. Can someone provide a good pointer to such
multiple input data processing ( or Join ) in mapreduce . It will also be
great if you can send in some sample code for the same.

Thanks ,

Matthew


doubts

2010-10-13 Thread Matthew John
Hi all ,

Had some doubts :

1) what happens when a mapper running in node A needs data from a block it
does nt have ? ( the block might be present in some other node in the
cluster )

2) in the Sort/Shuffle phase is just a logical representation of all map
outputs together sorted rite ? and again, what happens when reduce in Node C
needs access of some map outputs not in its memory?

Matthew .


Re: Easy Question

2010-10-05 Thread Matthew John
hi Maha,

  try the folowing :

goto ur dfs.data.dir/current

You will find a file VERSION.. just modify the namespace id in it with your
namespace id found in the log ( in this prev post -- 200395975 ).. restart
hadoop..
(bin/start-all.sh) ...

see if all the daemons are up..


regards,
Matthew


changing SequenceFile format

2010-09-13 Thread Matthew John
Hi guys,

I wanted to take in file with input : key1value1key2value2..
binary sequence file (key and value length are constant) as input for the
Sort (examples) . But as I understand the data in a standard Sequencefile of
hadoop is in the format : RecordlengthKeylengthKeyValue. . Where
should I modify the code so as to use my inputfile as input to the
recordreader.

Please pour in your views ..

Matthew


Re: changing SequenceFile format

2010-09-13 Thread Matthew John
When it  comes to Writer, I can see the append, appendRaw methods.. But the
next methods (many ! ) in Reader is confusing !.

Can you further info on it ?

Matthew


Error: Java heap space

2010-09-09 Thread Matthew John
Hi all,

  I tried to run a customised sort with following details ::

* I have a metafile to be sorted. So on testing basis, I created a
SequenceFile format of the metafile by appending a SequenceFile generated
header with the record part (i kept it in the same sequence-- record
length, key length, key, value) of the metafile.
* I also implemented the writables for the key and value in my records.
* I also implemented the input/output formats for my records (not sure
whether it is correct)
* I tried running this customized Sort with the new parameters and
inputfile. I also gave no. of  maps , reduces both as 1.

I am getting the following error :: *Task Id :
attempt_201009082009_0006_m_00_0, Status : FAILED*
*Error: Java heap space*

**Someone please throw some light on this...

thanks,
Matthew John


Re: How to rebuild Hadoop ??

2010-09-08 Thread Matthew John
Thanks Jeff !
Following what you have said, I build my hadoop core jar first (command -
ant jar). That created a hadoop-core.jar in the build. Now can you please
tell me how to use this as dependable for the building of examples.jar.
Because if I give ant example , it gives errors like the new classes I ve
included in the core are not found. I suppose thats because its using the
old hadoop-core.jar .

Thanks,

Matthew John


Re: How to rebuild Hadoop ??

2010-09-08 Thread Matthew John
Hey Jeff ,

I gave the command :

 bin/hadoop jar hadoop-0.20.2-examples.jar sort -libjars
./build/hadoop-0.20.3-dev-core.jar -inFormat
org.apache.hadoop.mapred.MetafileInputFormat -outFormat
org.apache.hadoop.mapred.MetafileOutputFormat -outKey
org.apache.hadoop.io.FpMetaId -outValue org.apache.hadoop.io.FpMetadata
fp_input fp_output

where hadoop-0.20.3-dev-core.jar is the new core jar (using command ant jar)
whereas hadoop-0.20.2-examples.jar is still the same old examples jar file
(I couldnot  make the new examples jar using ant examples since I doesnt
have the latest dependencies on the new classes i have defined). the other
parameters are the new classes I want to use for running Sort.
I feel I should make the new examples jar but dont know how to .. :( :( ..
please tell me how to give new core jar as a parameter to run the ant
examples.

I am getting the following errors when i ran the command.. :

java.lang.ClassNotFoundException:
org.apache.hadoop.mapred.MetafileInputFormat.. and so on ...

Thanks,

Matthew


Re: How to rebuild Hadoop ??

2010-09-08 Thread Matthew John
 target name=examples depends=jar, compile-examples description=Make
the Hadoop examples jar.
jar jarfile=${build.dir}/${final.name}-examples.jar
 basedir=${build.examples}
  manifest
attribute name=Main-Class
   value=org/apache/hadoop/examples/ExampleDriver/
  /manifest
/jar
  /target

this is part of the build.xml which creates the example.jar. Here (what I
understand) its given that it depends on jar which is the hadoop.core.jar. I
have a feeling its still depending on the older version of core.jar and so
its not able to find the classes which are not updated in the older version
of the core.jar. Therefore it gives ClassNotFound.

I want to make a new Example jar which depends on the new core.jar. Please
guide me on that and let me know if my understanding is wrong.

Thanks,

Matthew John


Re: How to rebuild Hadoop ??

2010-09-08 Thread Matthew John
Hey Guys ! ,

Finally my examples.jar got built :) :) .. It was just a small error - dint
initialize the package for some of the newly written files :P ..

Now i will run the command :

bin/hadoop jar hadoop-0.20.2-examples.jarnew one sort  -inFormat
org.apache.hadoop.mapred.MetafileInputFormat -outFormat
org.apache.hadoop.mapred.MetafileOutputFormat -outKey
org.apache.hadoop.io.FpMetaId -outValue org.apache.hadoop.io.FpMetadata
fp_input fp_output

and see what happens !!

Thanks a lot for your time..

Matthew


Re: Sort with customized input/output !!

2010-09-08 Thread Matthew John
Thanks for the reply Ted !!

What I understand is that a SequenceFile will have a header followed by the
records in a format : Recordlength,Keylength,Key,Value with a sync marker
coming at some regular interval..

It would be great if someone can take a look at the following..

Q 1) The thing is my file is basically in the format : header ( a different
one) followed by Record (Key Value). In this case the size of Record and Key
is fixed.I would like to know* if I can modify the core code to make the
SequenceFile format like this *. If yes what code should I look at ??

Q 2) *What is a Sync marker (can we define it )* ? Obviously my file would
not be having this. Can someone suggest a way to get around this obstacle.
My final aim is to take this file in , sort it with respect to Key and print
the sorted file ..

Thanks,
Matthew


Re: SequenceFile Header

2010-09-08 Thread Matthew John
Hi Edward ,

Thanks for your reply.

My aim is not to generate a SequenceFile. It is to take a file (of a certain
format) and sort it. So I guess I should create a input SequenceFile from
the original file and feed it to the Sort as input. Now the output will
again be SequenceFile format and I will have to convert it back to my
original file format.

So I am right now more concerned about step 1 (conversion of original file
to input sequence file) and step 3 (conversion of output sequence file to
original file format) .. It would be great if you can suggest some ways of
doing that. Also please correct me if my approach is wrong..

Thanks,

Matthew


Sort with customized input/output !!

2010-09-07 Thread Matthew John
Hey ,
M pretty new to Hadoop .

I need to Sort a Metafile (TBs) and thought of using Hadoop Sort (in
examples) for it.
My input metafile looks like this -- binary stream (only 1's and 0's). It
basically contains records of 40 bytes.
Every record goes like this :

long a; key -- 8 bytes. The rest of the structure will be the value --
32 bytes
long b;
int c;
int d;
int e;
int unprocessed;
int compress_attempted;
int gatherer;


I have created a *FpMetaId.java (extends BytesWritable)* corresponding to
the value and *FpMetadata.java (extends BytesWritable)* corresponding to
the key.

My sole aim is to get these records (40 bytes) sorted with the fp (double)
as the key. And I need to write these sorted records back into a metafile
(exactly my old metafile but with sorted records binaries only).
I also implemented ::

*MetafileInputFormat.java ( extends SequenceFileAsBinaryInputFormat) * ---
file making an input file format compatible to my record.
*MetafileOutputFormatK, V extends SequenceFileOutputFormat* --- file
making the output file format compatible to my record.
*MetafileRecordReader.java (extends
SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader )* ---
file implementing the record reader compatible to my record.

MetafileRecordWriter class has been implemented with in my
MetafileOutputFormat.java file.

Let me kindly get you through the sequence of events which followed :

1) I resolved all the errors in the writable classes (FpMetaId, FpMetadata)
and in/out formats (MetafileInputFormat, MetafileOutputFormat,) and
RecordReaders I implemented.

2) Writables I copied to /io folder. Other new files were copied to /mapred
folder. I successfully built it.

3) I modified the Sort file (the function I want to run with FpMetaId as key
and FpMetadata as value and imported these new classes in the file.) I
changed default conf settings to these required Writables and
RecordReaders.. I built hadoop using ant command after this. It successfully
got built.

*Q). Does this ensure all the new changes have got reflected on the jar. (
am I ready to go execute the sort function ?? )*

4) As I had already mentioned before, I am working with sequential file
format (binary) with a datastructure (key,value) repeating. So I wrote a C
code which generates random values for my datastructure and populated a file
, sequentially writing (binary) my (key,value)datastructure. I gave this as
my input for the sort which should sort my (key,values) with respect to
keys. I got the error : fp_input not a SequenceFile (fp_input is my input
file). I thought Seqfiles will just be stream of binaries.. Does it contain
any specific format ?

*Command used :  bin/hadoop jar hadoop-0.20.2-examples.jar sort fp_input
fp_output*

*Q) What does this imply ? I have no clue how to proceed further. Again, is
it because my jar file used to execute doesnt have the latest libraries ? I
could not get any good tutorials on this.
*

It would be great if someone can offer an helping hand to this noob.

Thanks,
Matthew John


How to rebuild Hadoop ??

2010-09-07 Thread Matthew John
Hi all,


 I wrote some new writable files corresponding to my data input. I added
them to /src/org//io/ where all the writables reside. Similarly, I
also wrote input/output format files and a recordreader and added them to
src/mapred/./mapred/ where all related files reside.

 I want to run the Sort function (in examples) with these new classes
(writables, recreader, i/oformat). So I also modified the Sort to
incorporate these files and import them in the Sort.java file. After all
this, I gave a ant clean and then ant command to build everything fresh. But
nothing really happened I guess because when I run the program , it give
ClassNotFoundException for the classes I give as parameters in the command.

  Some one please help me out !! How to modify the core/ files (incorporate
more core io/mapred files) in HADOOP !!

Thanks,

Matthew John


Re: How to rebuild Hadoop ??

2010-09-07 Thread Matthew John
Thanks a lot Jeff !

The problem is that everytime I build (using ant ) there is a build folder
created. But there is no examples.jar created inside that. I wanted to add
some files into io package and mapred package. So I suppose I should put the
files appropriately ( inside io and mapred folder respectively). I want to
run the Sort in examples.jar using these added classes. I guess I can import
these new files in the Sort code and build the entire thing again.

But I am not able to figure out how to rebuild these core containing jar and
examples jar with the modified sort.