Re: Hadoop Profiling!
Are you interested in simply profiling your own code (in which case you can clearly use what ever java profiler you want), or your construction of the MapReduce job, ie how much time is being spent in the Map vs the sort vs the shuffle vs the Reduce. I am not aware of a good solution to the second problem, can anyone comment? Ashish On Wed, Oct 8, 2008 at 12:06 PM, Stefan Groschupf [EMAIL PROTECTED] wrote: Just run your map reduce job local and connect your profiler. I use yourkit. Works great! You can profile your map reduce job running the job in local mode as ant other java app as well. However we also profiled in a grid. You just need to install the yourkit agent into the jvm of the node you want to profile and than you connect to the node when the job runs. However you need to time things well, since the task jvm is shutdown as soon your job is done. Stefan ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 8, 2008, at 11:27 AM, Gerardo Velez wrote: Hi! I've developed a Map/Reduce algorithm to analyze some logs from web application. So basically, we are ready to start QA test phase, so now, I would like to now how efficient is my application from performance point of view. So is there any procedure I could use to do some profiling? Basically I need basi data, like time excecution or code bottlenecks. Thanks in advance. -- Gerardo Velez
Re: Hadoop Profiling!
Great, thanks for this info, is there any chance that this information can also be exposed for streaming jobs as well? (All of the jobs that we run in our lab are only via streaming...) Thanks! Ashish On Wed, Oct 8, 2008 at 12:30 PM, George Porter [EMAIL PROTECTED]wrote: Hi Ashish, I believe that Ari committed two instrumentation classes, TaskTrackerInstrumentation and JobTrackerInstrumentation, (both in src/mapred/org/apache/hadoop/mapred) that can give you information on when components of your M/R jobs start and stop. I'm in the process of writing some additional instrumentation APIs that collect timing information about the RPC and HDFS layers, and will hopefully be able to submit a patch in a few weeks. Thanks, George Ashish Venugopal wrote: Are you interested in simply profiling your own code (in which case you can clearly use what ever java profiler you want), or your construction of the MapReduce job, ie how much time is being spent in the Map vs the sort vs the shuffle vs the Reduce. I am not aware of a good solution to the second problem, can anyone comment? Ashish On Wed, Oct 8, 2008 at 12:06 PM, Stefan Groschupf [EMAIL PROTECTED] wrote: Just run your map reduce job local and connect your profiler. I use yourkit. Works great! You can profile your map reduce job running the job in local mode as ant other java app as well. However we also profiled in a grid. You just need to install the yourkit agent into the jvm of the node you want to profile and than you connect to the node when the job runs. However you need to time things well, since the task jvm is shutdown as soon your job is done. Stefan ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 8, 2008, at 11:27 AM, Gerardo Velez wrote: Hi! I've developed a Map/Reduce algorithm to analyze some logs from web application. So basically, we are ready to start QA test phase, so now, I would like to now how efficient is my application from performance point of view. So is there any procedure I could use to do some profiling? Basically I need basi data, like time excecution or code bottlenecks. Thanks in advance. -- Gerardo Velez -- George Porter, Sun Labs/CTO Sun Microsystems - San Diego, Calif. [EMAIL PROTECTED] 1.858.526.9328
Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
Its slightly counterintuitive, but I used to get errors like this when my reducers would run out of memory. Turns out that if a reducer uses up too much memory and brings down a node, that it could also kill the services that are making map data available to other reducers. I cant explain exactly why this exact error happens, but I have found that the culprit is often memory usage (normally in the reducer). Ashish On Thu, Aug 28, 2008 at 7:59 AM, Jason Venner [EMAIL PROTECTED] wrote: We have started to see this class of error under hadoop 0.16.1 on a medium sized hdfs cluster under moderate load wangxu wrote: Hi,all I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar, and running hadoop on one namenode and 4 slaves. attached is my hadoop-site.xml, and I didn't change the file hadoop-default.xml when data in segments are large,this kind of errors occure: java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-2/data at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102) at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1646) at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:1712) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1787) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:104) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:79) at org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.java:112) at org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReader.java:130) at org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(CompositeRecordReader.java:398) at org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:56) at org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:33) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) how can I correct this? thanks. Xu -- Jason Venner Attributor - Program the Web http://www.attributor.com/ Attributor is hiring Hadoop Wranglers and coding wizards, contact if interested
Re: MR use case where each reducer/mapper receives different parameters
Also, just to clarify a couple of points: I am using Hadoop On Demand, which means that to run a job, I first have to allocate a cluster, I am using the hod script mechanism, where the cluster is allocated for running time of my hod script. If my script could schedule multiple MR jobs, but then only relinquish control when all jobs are done, I could simply schedule one MR per parameter setting. Ashish On Wed, Aug 13, 2008 at 2:08 PM, Ashish Venugopal [EMAIL PROTECTED]wrote: Hi, I need to implement a specific use case that comes up often in the machine learning / nlp community. Often we want to run some kind of optimization process on a data set, but we want to run the optimization at several different initial parameters. While this is not the usual MR paradigm of splitting up a large task and then recombining the partial outputs, I would like to use Hadoop to handle the parallelization. It mentions on the streaming documentation page ( http://hadoop.apache.org/core/docs/current/streaming.html), that streaming can be used to create jobs with multiple different parameters - but does not give any example, so its not clear to me how to give each mapper (or each reducer), a specific set of parameters. If each mapper/reducer had access some kind of job index number, i could potentially write a side file which maps ids-params, but this seems clumsy. The only solution that I have now, is that my mapper phase will replicate the data, pairing it with a set of keys that represent different parameters. Then each reducer will see a key-value pair, by reading the key its can get its parameters, and the value has the data. Any other solutions? Thanks! Ashish
Re: Difference between Hadoop Streaming and Normal mode
There is definitely functionality in normal mode that is not available in streaming, like the ability to write counters to instruments jobs. I personally just use streaming, so I am interested to see if there are further key differences... Ashish On Tue, Aug 12, 2008 at 3:09 PM, Gaurav Veda [EMAIL PROTECTED][EMAIL PROTECTED] wrote: Hi All, This might seem too silly, but I couldn't find a satisfactory answer to this yet. What are the advantages / disadvantages of using Hadoop Streaming over the normal mode (wherein you write your own mapper and reducer in Java)? From what I gather, the real advantage of Hadoop Streaming is that you can use any executable (in c / perl / python etc) as a mapper / reducer. A slight disadvantage is that the default is to read (write) from the standard input (output) ... though one can specify their own Input and Output format (and package it with the default hadoop streaming jar file). My point is, why should I ever use the normal mode? Streaming seems just as good. Is there a performance problem or do I have only limited control over my job if I use the streaming mode or some other issue? Thanks! Gaurav -- Share what you know, learn what you don't !
using too many mappers?
Is it possible that using too many mappers causes issues in Hadoop 0.17.1? I have an input data directory with 100 files in it. I am running a job that takes these files as input. When I set -jobconf mapred.map.tasks=200 in the job invocation, its seems like the mappers received empty inputs (that my binary does not cleanly handle). When I unset the mapred.map.tasks parameter, the jobs runs fine, many mappers do get used because the input files are manually split. Can anyone offer an explanation / have there been changes in the use of this parameter between 0.16.4 and 0.17.1? Ashish
problem when many map tasks are used (since 0.17.1 was installed)
The crash below occurs when I run many ( -jobconf mapred.map.tasks=200) mappers. It does not occur if I set mapred.map.task=1 even when I allocated many machines (causing there to be many mappers). But when I set number of map.tasks to 200 the error below happens. This just started happening after the recent upgrade to 0.17.1 (previously was using 0.16.4). This is a streaming job. Any help is appreciated. Ashish Exception closing file /user/ashishv/iwslt/syn_baseline/translation_dev/_temporary/_task_200806272233_0001_m_000174_0/part-00174 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not complete write to file /user/ashishv/iwslt/syn_baseline/translation_dev/_tem porary/_task_200806272233_0001_m_000174_0/part-00174 by DFSClient_task_200806272233_0001_m_000174_0 at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:332) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2655) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2576) at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:221)
Re: hadoop benchmarked, too slow to use
Just a small note (does not answer your question, but deals with your testing command), when running the system command version below, its important to test with sort -k 1 -t $TAB where TAB is something like: TAB=`echo \t` to ensure that you sort by key, rather than the whole line. Sorting by the whole line can cause your reduce code to seem to work during testing (if you are testing on the command line), but then not work correctly via Hadoop. On Tue, Jun 10, 2008 at 6:56 PM, Elia Mazzawi [EMAIL PROTECTED] wrote: Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a regular expression to count the number of each unique starting string. we had 3500 files containing 3,015,294 lines totaling 5 GB. to benchmark it we ran bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t' it took 26 minutes then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data: cat * | sed -e s/\ .*// |sort | uniq -c /tmp/out (sed regexpr is tab not spaces) which took 2.5 minutes Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive. so we are seeing a 10X performance penalty for using Hadoop vs the system commands, is that expected? we were expecting hadoop to be faster since it is distributed? perhaps there is too much overhead involved here? is the data too small?
HOD cluster could not be allocated message
Hi, I am using HOD on version 0.16.4, and I *occasionally* get the following message CRITICAL - Cannot allocate cluster /user/ CRITICAL - Error 8 in allocating the cluster. Cannot run the script. It seems that this message comes when the pool of machines is completely in use, but my understanding was that the job would just be queued and would wait till machines are available, but instead, I sometimes get the message above. Is there a timeout that governs this process that might need to be changed? Error 8: is described in the documentation as below, but this doesnt seem to shed light on what happened here 8 Job tracker failure Similar to the causes in DFS failure case.
HOD cluster could not be allocated message
Hi, I am using HOD on version 0.16.4, and I *occasionally* get the following message CRITICAL - Cannot allocate cluster /user/ CRITICAL - Error 8 in allocating the cluster. Cannot run the script. It seems that this message comes when the pool of machines is completely in use, but my understanding was that the job would just be queued and would wait till machines are available, but instead, I sometimes get the message above. Is there a timeout that governs this process that might need to be changed? Error 8: is described in the documentation as below, but this doesnt seem to shed light on what happened here 8 Job tracker failure Similar to the causes in DFS failure case.
typo on hadoop streaming page
Hi, I am not sure who to inform, but there is a typo on: http://hadoop.apache.org/core/docs/r0.16.4/streaming.html GzipCode should be GzipCodec I was just copying and pasting and my streaming job kept failing... The snippet from the page is below. How do I generate output files with gzip format? Instead of plain text files, you can generate gzip files as your generated output. Pass '-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode' as option to your streaming job.
Re: one key per output part file
Thanks Yuri! I followed your pattern here and the version where you make the sytem call directly to -put onto DFS works for me. I did not set $ENV{HADOOP_HEAPSIZE}=300; and it seems to work fine (i didnt try setting this variable to see if it failed). I also used perl's built in File::Temp mechanism to avoid worrying bout manually deleting the temp file. Thanks! Ashish On Thu, Apr 3, 2008 at 12:07 PM, Yuri Pradkin [EMAIL PROTECTED] wrote: Here is how we (attempt to) do it: Reducer (in streaming) writes one file for each different key it receives as input. Here's some example code in perl: my $envdir = $ENV{'mapred_output_dir'}; my $fs = ($envdir =~ s/^file://); if ($fs) { #output goes onto NFS open(FILEOUT, $envdir/${filename}.png) or die $0: cannot open $envdir/${filename}.png: $!\n; } else { #output specifies DFS open(FILEOUT, /tmp/${filename}.png) or die Cannot open /tmp/${filename}.png: $!\n; #or pipe to dfs -put } ... #write FILEOUT if ($fs) { #for NFS just fix permissions chmod 0664, $envdir/$filename.png; chmod 0775, $envdir; } else { #for HDFS -put the file my $hadoop = $ENV{HADOOP_HOME} . /bin/hadoop; $ENV{HADOOP_HEAPSIZE}=300; system($hadoop, dfs, -put, /tmp/${filename}.png, $envdir/${filename}.png) and unlink /tmp/${filename}.png; } If -output option to streaming specifies an NFS directory, everything works except it doesn't scale. We must use mapred_output_dir environment because it points to the temporary directory and you don't want 2 or more instances of the same tasks writing to the same file. If -output points to HDFS, however, the code above bombs while trying to -put a file with an error something like couldn't not reserve enough memory for java vm heap/libs at which point Java dies. If anyone has any suggestions on how to fix that, I'd appreciate it. Thanks, -Yuri On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote: Hi, I am using Hadoop streaming and I am trying to create a MapReduce that will generate output where a single key is found in a single output part file. Does anyone know how to ensure this condition? I want the reduce task (no matter how many are specified), to only receive key-value output from a single key each, process the key-value pairs for this key, write an output part-XXX file, and only then process the next key. Here is the task that I am trying to accomplish: Input: Corpus T (lines of text), Corpus V (each line has 1 word) Output: Each part-XXX should contain the lines of T that contain the word from line XXX in V. Any help/ideas are appreciated. Ashish
Re: one key per output part file
On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote: curious - why do we need a file per XXX? - if the data needs to be exported (either to a sql db or an external file system) - then why not do so directly from the reducer (instead of trying to create these intermediate small files in hdfs)? data can be written to tmp tables/files and can be overwritten in case the reducer re-runs (and then committed to final location once the job is complete) The second case (data needs to be exported) is the reason that I have. Each of these small files is used in an external process. This seems like a good solution - only question then is where can these files be written to safely? Local directory? /tmp? Ashish -Original Message- From: [EMAIL PROTECTED] on behalf of Ashish Venugopal Sent: Tue 4/1/2008 6:42 PM To: core-user@hadoop.apache.org Subject: Re: one key per output part file This seems like a reasonable solution - but I am using Hadoop streaming and byreducer is a perl script. Is it possible to handle side-effect files in streaming? I havent found anything that indicates that you can... Ashish On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning [EMAIL PROTECTED] wrote: Try opening the desired output file in the reduce method. Make sure that the output files are relative to the correct task specific directory (look for side-effect files on the wiki). On 4/1/08 5:57 PM, Ashish Venugopal [EMAIL PROTECTED] wrote: Hi, I am using Hadoop streaming and I am trying to create a MapReduce that will generate output where a single key is found in a single output part file. Does anyone know how to ensure this condition? I want the reduce task (no matter how many are specified), to only receive key-value output from a single key each, process the key-value pairs for this key, write an output part-XXX file, and only then process the next key. Here is the task that I am trying to accomplish: Input: Corpus T (lines of text), Corpus V (each line has 1 word) Output: Each part-XXX should contain the lines of T that contain the word from line XXX in V. Any help/ideas are appreciated. Ashish
Re: one key per output part file
This seems like a reasonable solution - but I am using Hadoop streaming and byreducer is a perl script. Is it possible to handle side-effect files in streaming? I havent found anything that indicates that you can... Ashish On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning [EMAIL PROTECTED] wrote: Try opening the desired output file in the reduce method. Make sure that the output files are relative to the correct task specific directory (look for side-effect files on the wiki). On 4/1/08 5:57 PM, Ashish Venugopal [EMAIL PROTECTED] wrote: Hi, I am using Hadoop streaming and I am trying to create a MapReduce that will generate output where a single key is found in a single output part file. Does anyone know how to ensure this condition? I want the reduce task (no matter how many are specified), to only receive key-value output from a single key each, process the key-value pairs for this key, write an output part-XXX file, and only then process the next key. Here is the task that I am trying to accomplish: Input: Corpus T (lines of text), Corpus V (each line has 1 word) Output: Each part-XXX should contain the lines of T that contain the word from line XXX in V. Any help/ideas are appreciated. Ashish