Re: Hadoop Profiling!

2008-10-08 Thread Ashish Venugopal
Are you interested in simply profiling your own code (in which case you can
clearly use what ever java profiler you want), or your construction of the
MapReduce job, ie  how much time is being spent in the Map vs the sort vs
the shuffle vs the Reduce. I am not aware of a good solution to the second
problem, can anyone comment?

Ashish

On Wed, Oct 8, 2008 at 12:06 PM, Stefan Groschupf [EMAIL PROTECTED] wrote:

 Just run your map reduce job local and connect your profiler. I use
 yourkit.
 Works great!
 You can profile your map reduce job running the job in local mode as ant
 other java app as well.
 However we also profiled in a grid. You just need to install the yourkit
 agent into the jvm of the node you want to profile and than you connect to
 the node when the job runs.
 However you need to time things well, since the task jvm is shutdown as
 soon your job is done.
 Stefan

 ~~~
 101tec Inc., Menlo Park, California
 web:  http://www.101tec.com
 blog: http://www.find23.net




 On Oct 8, 2008, at 11:27 AM, Gerardo Velez wrote:

  Hi!

 I've developed a Map/Reduce algorithm to analyze some logs from web
 application.

 So basically, we are ready to start QA test phase, so now, I would like to
 now how efficient is my application
 from performance point of view.

 So is there any procedure I could use to do some profiling?


 Basically I need basi data, like time excecution or code bottlenecks.


 Thanks in advance.

 -- Gerardo Velez





Re: Hadoop Profiling!

2008-10-08 Thread Ashish Venugopal
Great, thanks for this info, is there any chance that this information can
also be exposed for streaming jobs as well?
(All of the jobs that we run in our lab are only via streaming...)

Thanks!

Ashish

On Wed, Oct 8, 2008 at 12:30 PM, George Porter [EMAIL PROTECTED]wrote:

 Hi Ashish,

 I believe that Ari committed two instrumentation classes,
 TaskTrackerInstrumentation and JobTrackerInstrumentation, (both in
 src/mapred/org/apache/hadoop/mapred) that can give you information on when
 components of your M/R jobs start and stop.  I'm in the process of writing
 some additional instrumentation APIs that collect timing information about
 the RPC and HDFS layers, and will hopefully be able to submit a patch in a
 few weeks.

 Thanks,
 George


 Ashish Venugopal wrote:

 Are you interested in simply profiling your own code (in which case you
 can
 clearly use what ever java profiler you want), or your construction of the
 MapReduce job, ie  how much time is being spent in the Map vs the sort vs
 the shuffle vs the Reduce. I am not aware of a good solution to the second
 problem, can anyone comment?

 Ashish

 On Wed, Oct 8, 2008 at 12:06 PM, Stefan Groschupf [EMAIL PROTECTED] wrote:



 Just run your map reduce job local and connect your profiler. I use
 yourkit.
 Works great!
 You can profile your map reduce job running the job in local mode as ant
 other java app as well.
 However we also profiled in a grid. You just need to install the yourkit
 agent into the jvm of the node you want to profile and than you connect
 to
 the node when the job runs.
 However you need to time things well, since the task jvm is shutdown as
 soon your job is done.
 Stefan

 ~~~
 101tec Inc., Menlo Park, California
 web:  http://www.101tec.com
 blog: http://www.find23.net




 On Oct 8, 2008, at 11:27 AM, Gerardo Velez wrote:

  Hi!


 I've developed a Map/Reduce algorithm to analyze some logs from web
 application.

 So basically, we are ready to start QA test phase, so now, I would like
 to
 now how efficient is my application
 from performance point of view.

 So is there any procedure I could use to do some profiling?


 Basically I need basi data, like time excecution or code bottlenecks.


 Thanks in advance.

 -- Gerardo Velez









 --
 George Porter, Sun Labs/CTO
 Sun Microsystems - San Diego, Calif.
 [EMAIL PROTECTED] 1.858.526.9328




Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-08-28 Thread Ashish Venugopal
Its slightly counterintuitive, but I used to get errors like this when my
reducers would run out of memory. Turns out that if a reducer uses up too
much memory and brings down a node, that it could also kill the services
that are making map data available to other reducers. I cant explain exactly
why this exact error happens, but I have found that the culprit is often
memory usage (normally in the reducer).
Ashish

On Thu, Aug 28, 2008 at 7:59 AM, Jason Venner [EMAIL PROTECTED] wrote:

 We have started to see this class of error under hadoop 0.16.1 on a
 medium sized hdfs cluster under moderate load

 wangxu wrote:
  Hi,all
  I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
  and running hadoop on one namenode and 4 slaves.
  attached is my hadoop-site.xml, and I didn't change the file
  hadoop-default.xml
 
  when data in segments are large,this kind of errors occure:
 
  java.io.IOException: Could not obtain block:
 blk_-2634319951074439134_1129
 file=/user/root/crawl_debug/segments/20080825053518/content/part-2/data
at
 org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462)
at
 org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312)
at
 org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at
 org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
at
 org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
at
 org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1646)
at
 org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:1712)
at
 org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1787)
at
 org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:104)
at
 org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:79)
at
 org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.java:112)
at
 org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReader.java:130)
at
 org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(CompositeRecordReader.java:398)
at
 org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:56)
at
 org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:33)
at
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:165)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 
 
  how can I correct this?
  thanks.
  Xu
 
 
 --
 Jason Venner
 Attributor - Program the Web http://www.attributor.com/
 Attributor is hiring Hadoop Wranglers and coding wizards, contact if
 interested



Re: MR use case where each reducer/mapper receives different parameters

2008-08-13 Thread Ashish Venugopal
Also, just to clarify a couple of points:
I am using Hadoop On Demand, which means that to run a job, I first have to
allocate a cluster, I am using the hod script mechanism, where the cluster
is allocated for running time of my hod script. If my script could schedule
multiple MR jobs, but then only relinquish control when all jobs are done, I
could simply schedule one MR per parameter setting.

Ashish


On Wed, Aug 13, 2008 at 2:08 PM, Ashish Venugopal [EMAIL PROTECTED]wrote:

 Hi, I need to implement a specific use case that comes up often in the
 machine learning / nlp community. Often we want to run some kind of
 optimization process on a data set, but we want to run the optimization at
 several different initial parameters. While this is not the usual MR
 paradigm of splitting up a large task and then recombining the partial
 outputs, I would like to use Hadoop to handle the parallelization.
 It mentions on the streaming documentation page (
 http://hadoop.apache.org/core/docs/current/streaming.html), that streaming
 can be used to create jobs with multiple different parameters - but does not
 give any example, so its not clear to me how to give each mapper (or each
 reducer), a specific set of parameters. If each mapper/reducer had access
 some kind of job index number, i could potentially write a side file which
 maps ids-params, but this seems clumsy.
 The only solution that I have now, is that my mapper phase will replicate
 the data, pairing it with a set of keys that represent different parameters.
 Then each reducer will see a key-value pair, by reading the key its can get
 its parameters, and the value has the data. Any other solutions?

 Thanks!

 Ashish



Re: Difference between Hadoop Streaming and Normal mode

2008-08-12 Thread Ashish Venugopal
There is definitely functionality in normal mode that is not available in
streaming, like the ability to write counters to instruments jobs. I
personally just use streaming, so I am interested to see if there are
further key differences...
Ashish

On Tue, Aug 12, 2008 at 3:09 PM, Gaurav Veda
[EMAIL PROTECTED][EMAIL PROTECTED]
 wrote:

 Hi All,

 This might seem too silly, but I couldn't find a satisfactory answer
 to this yet. What are the advantages / disadvantages of using Hadoop
 Streaming over the normal mode (wherein you write your own mapper and
 reducer in Java)? From what I gather, the real advantage of Hadoop
 Streaming is that you can use any executable (in c / perl / python
 etc) as a mapper / reducer.
 A slight disadvantage is that the default is to read (write) from the
 standard input (output) ... though one can specify their own Input and
 Output format (and package it with the default hadoop streaming jar
 file).

 My point is, why should I ever use the normal mode? Streaming seems
 just as good. Is there a performance problem or do I have only limited
 control over my job if I use the streaming mode or some other issue?

 Thanks!
 Gaurav
 --
 Share what you know, learn what you don't !



using too many mappers?

2008-07-18 Thread Ashish Venugopal
Is it possible that using too many mappers causes issues in Hadoop 0.17.1? I
have an input data directory with 100 files in it. I am running a job that
takes these files as input. When I set -jobconf mapred.map.tasks=200 in
the job invocation, its seems like the mappers received empty inputs (that
my binary does not cleanly handle). When I unset the mapred.map.tasks
parameter, the jobs runs fine, many mappers do get used because the input
files are manually split. Can anyone offer an explanation / have there been
changes in the use of this parameter between 0.16.4 and 0.17.1?
Ashish


problem when many map tasks are used (since 0.17.1 was installed)

2008-06-30 Thread Ashish Venugopal
The crash below occurs when I run many ( -jobconf mapred.map.tasks=200)
mappers. It does not occur if I set mapred.map.task=1 even when I allocated
many machines (causing there to be many mappers). But when I set
number of map.tasks to 200

the error below happens. This just started happening after the recent
upgrade to 0.17.1

(previously was using 0.16.4). This is a streaming job. Any help is appreciated.


Ashish

Exception closing file
/user/ashishv/iwslt/syn_baseline/translation_dev/_temporary/_task_200806272233_0001_m_000174_0/part-00174
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not complete
write to file /user/ashishv/iwslt/syn_baseline/translation_dev/_tem
porary/_task_200806272233_0001_m_000174_0/part-00174 by
DFSClient_task_200806272233_0001_m_000174_0
at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:332)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2655)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2576)
at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:221)


Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Ashish Venugopal
Just a small note (does not answer your question, but deals with your
testing command), when running the system command version below, its
important to test with
sort -k 1 -t $TAB

where TAB is something like:

TAB=`echo \t`

to ensure that you sort by key, rather than the whole line. Sorting by the
whole line can cause your reduce code to seem to work during testing (if you
are testing on the command line), but then not work correctly via Hadoop.



On Tue, Jun 10, 2008 at 6:56 PM, Elia Mazzawi [EMAIL PROTECTED]
wrote:

 Hello,

 we were considering using hadoop to process some data,
 we have it set up on 8 nodes ( 1 master + 7 slaves)

 we filled the cluster up with files that contain tab delimited data.
 string \tab string etc
 then we ran the example grep with a regular expression to count the number
 of each unique starting string.
 we had 3500 files containing 3,015,294 lines totaling 5 GB.

 to benchmark it we ran
 bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
 '^[a-zA-Z]+\t'
 it took 26 minutes

 then to compare, we ran this bash command on one of the nodes, which
 produced the same output out of the data:

 cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
 (sed regexpr is tab not spaces)

 which took 2.5 minutes

 Then we added 10X the data into the cluster and reran Hadoop, it took 214
 minutes which is less than 10X the time, but still not that impressive.


 so we are seeing a 10X performance penalty for using Hadoop vs the system
 commands,
 is that expected?
 we were expecting hadoop to be faster since it is distributed?
 perhaps there is too much overhead involved here?
 is the data too small?



HOD cluster could not be allocated message

2008-06-09 Thread Ashish Venugopal
Hi, I am using HOD on version 0.16.4, and I *occasionally* get the following
message

CRITICAL - Cannot allocate cluster /user/
CRITICAL - Error 8 in allocating the cluster. Cannot run the script.

It seems that this message comes when the pool of machines is completely in
use, but my understanding was that the job would just be queued and would
wait till machines are available, but instead, I sometimes get the message
above.

Is there a timeout that governs this process that might need to be changed?

Error 8: is described in the documentation as below, but this doesnt seem to
shed light on what happened here

 8   Job tracker failure   Similar to the causes in DFS failure case.


HOD cluster could not be allocated message

2008-06-09 Thread Ashish Venugopal
Hi, I am using HOD on version 0.16.4, and I *occasionally* get the following
message

CRITICAL - Cannot allocate cluster /user/
CRITICAL - Error 8 in allocating the cluster. Cannot run the script.

It seems that this message comes when the pool of machines is completely in
use, but my understanding was that the job would just be queued and would
wait till machines are available, but instead, I sometimes get the message
above.

Is there a timeout that governs this process that might need to be changed?

Error 8: is described in the documentation as below, but this doesnt seem to
shed light on what happened here

 8   Job tracker failure   Similar to the causes in DFS failure case.


typo on hadoop streaming page

2008-05-27 Thread Ashish Venugopal
Hi, I am not sure who to inform, but there is a typo on:
http://hadoop.apache.org/core/docs/r0.16.4/streaming.html

GzipCode should be GzipCodec

I was just copying and pasting and my streaming job kept failing... The
snippet from the page is below.
How do I generate output files with gzip format?

Instead of plain text files, you can generate gzip files as your generated
output. Pass '-jobconf mapred.output.compress=true -jobconf
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode' as
option to your streaming job.


Re: one key per output part file

2008-04-03 Thread Ashish Venugopal
Thanks Yuri! I followed your pattern here and the version where you make the
sytem call directly to -put onto DFS works for me. I did not set
$ENV{HADOOP_HEAPSIZE}=300;
and it seems to work fine (i didnt try setting this variable to see if it
failed).
I also used perl's built in File::Temp mechanism to avoid worrying bout
manually deleting the temp file.

Thanks!

Ashish


On Thu, Apr 3, 2008 at 12:07 PM, Yuri Pradkin [EMAIL PROTECTED] wrote:

 Here is how we (attempt to) do it:

 Reducer (in streaming) writes one file for each different key it receives
 as input.
 Here's some example code in perl:
my $envdir = $ENV{'mapred_output_dir'};
my $fs = ($envdir =~ s/^file://);
if ($fs) {
#output goes onto NFS
open(FILEOUT, $envdir/${filename}.png) or die $0: cannot open
 $envdir/${filename}.png: $!\n;
} else {
#output specifies DFS
open(FILEOUT, /tmp/${filename}.png) or die Cannot open
 /tmp/${filename}.png: $!\n; #or pipe to dfs -put
}
... #write FILEOUT
if ($fs) {
#for NFS just fix permissions
chmod 0664, $envdir/$filename.png;
chmod 0775, $envdir;
} else {
#for HDFS -put the file
my $hadoop = $ENV{HADOOP_HOME} . /bin/hadoop;
$ENV{HADOOP_HEAPSIZE}=300;
system($hadoop,  dfs, -put, /tmp/${filename}.png,
  $envdir/${filename}.png) and
unlink /tmp/${filename}.png;
}

 If -output option to streaming specifies an NFS directory, everything
 works except
 it doesn't scale.  We must use mapred_output_dir environment because it
 points to
 the temporary directory and you don't want 2 or more instances of the same
 tasks writing
 to the same file.

 If -output points to HDFS, however, the code above bombs while trying to
 -put a file
 with an error something like couldn't not reserve enough memory for java
 vm heap/libs
 at which point Java dies.  If anyone has any suggestions on how to fix
 that, I'd
 appreciate it.

 Thanks,

  -Yuri

 On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote:
  Hi, I am using Hadoop streaming and I am trying to create a MapReduce
 that
  will generate output where a single key is found in a single output part
  file.
  Does anyone know how to ensure this condition? I want the reduce task
 (no
  matter how many are specified), to only receive
  key-value output from a single key each, process the key-value pairs for
  this key, write an output part-XXX file, and only
  then process the next key.
 
  Here is the task that I am trying to accomplish:
 
  Input: Corpus T (lines of text), Corpus V (each line has 1 word)
  Output: Each part-XXX should contain the lines of T that contain the
 word
  from line XXX in V.
 
  Any help/ideas are appreciated.
 
  Ashish





Re: one key per output part file

2008-04-02 Thread Ashish Venugopal
On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma [EMAIL PROTECTED]
wrote:

 curious - why do we need a file per XXX?

 - if the data needs to be exported (either to a sql db or an external file
 system) - then why not do so directly from the reducer (instead of trying to
 create these intermediate small files in hdfs)? data can be written to tmp
 tables/files and can be overwritten in case the reducer re-runs (and then
 committed to final location once the job is complete)


The second case (data needs to be exported) is the reason that I have. Each
of these small files is used in an external process. This seems like a good
solution - only question then is where can these files be written to safely?
Local directory? /tmp?

Ashish






 -Original Message-
 From: [EMAIL PROTECTED] on behalf of Ashish Venugopal
 Sent: Tue 4/1/2008 6:42 PM
 To: core-user@hadoop.apache.org
 Subject: Re: one key per output part file

 This seems like a reasonable solution - but I am using Hadoop streaming
 and
 byreducer is a perl script. Is it possible to handle side-effect files in
 streaming? I havent found
 anything that indicates that you can...

 Ashish

 On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning [EMAIL PROTECTED] wrote:

 
 
  Try opening the desired output file in the reduce method.  Make sure
 that
  the output files are relative to the correct task specific directory
 (look
  for side-effect files on the wiki).
 
 
 
  On 4/1/08 5:57 PM, Ashish Venugopal [EMAIL PROTECTED] wrote:
 
   Hi, I am using Hadoop streaming and I am trying to create a MapReduce
  that
   will generate output where a single key is found in a single output
 part
   file.
   Does anyone know how to ensure this condition? I want the reduce task
  (no
   matter how many are specified), to only receive
   key-value output from a single key each, process the key-value pairs
 for
   this key, write an output part-XXX file, and only
   then process the next key.
  
   Here is the task that I am trying to accomplish:
  
   Input: Corpus T (lines of text), Corpus V (each line has 1 word)
   Output: Each part-XXX should contain the lines of T that contain the
  word
   from line XXX in V.
  
   Any help/ideas are appreciated.
  
   Ashish
 
 




Re: one key per output part file

2008-04-01 Thread Ashish Venugopal
This seems like a reasonable solution - but I am using Hadoop streaming and
byreducer is a perl script. Is it possible to handle side-effect files in
streaming? I havent found
anything that indicates that you can...

Ashish

On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning [EMAIL PROTECTED] wrote:



 Try opening the desired output file in the reduce method.  Make sure that
 the output files are relative to the correct task specific directory (look
 for side-effect files on the wiki).



 On 4/1/08 5:57 PM, Ashish Venugopal [EMAIL PROTECTED] wrote:

  Hi, I am using Hadoop streaming and I am trying to create a MapReduce
 that
  will generate output where a single key is found in a single output part
  file.
  Does anyone know how to ensure this condition? I want the reduce task
 (no
  matter how many are specified), to only receive
  key-value output from a single key each, process the key-value pairs for
  this key, write an output part-XXX file, and only
  then process the next key.
 
  Here is the task that I am trying to accomplish:
 
  Input: Corpus T (lines of text), Corpus V (each line has 1 word)
  Output: Each part-XXX should contain the lines of T that contain the
 word
  from line XXX in V.
 
  Any help/ideas are appreciated.
 
  Ashish