Re: Re: Re: measure the time taken to complete map and reduce phase

2011-07-07 Thread hailong.yang1115

I think the TOTAL_MAPS has the same meaning with the FINISHED_MAPS, which 
represents the total number of map tasks successfully executed. Although there 
is another metric named Launched_map_tasks, which is the sum of  FINISHED_MAPS, 
FAILED_MAPS and KILLED_MAPS.

And it is the same for Reduce tasks.


Cheer!

Hailong

2011-07-08 



***
* Hailong Yang, PhD. Candidate 
* Sino-German Joint Software Institute, 
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1...@gmail.com
* Address: G413, New Main Building in Beihang University, 
*  No.37 XueYuan Road,HaiDian District, 
*  Beijing,P.R.China,100191
***



发件人: sangroya 
发送时间: 2011-07-07  23:50:45 
收件人: hadoop-user 
抄送: 
主题: Re: Re: measure the time taken to complete map and reduce phase 
 
Hi,
Thanks for the response!
I have the following queries regarding the Job History file.
I want to know if what the TOTAL_MAPS in the Job history represents.
Also, if FINISHED_MAPS represents the TOTAL_MAPS or the (TOTAL_MAPS -
FAILED_MAPS).
Does FINISHED_MAPS represents successfully executed maps.
I have the same question for REDUCE tasks.
Thanks,
Amit
On Thu, Jul 7, 2011 at 10:58 AM, Hailong [via Lucene]
 wrote:
> Hi sangroya,
>
> I think you may be interested in reading the following piece of code from
> JobHistory.java in Hadoop.
>
> /**
>  * Generates the job history filename for a new job
>  */
> private static String getNewJobHistoryFileName(JobConf jobConf, JobID
> id) {
>   return JOBTRACKER_UNIQUE_STRING
>  + id.toString() + "_" + getUserName(jobConf) + "_"
>  + trimJobName(getJobName(jobConf));
> }
>
> /**
>  * Trims the job-name if required
>  */
> private static String trimJobName(String jobName) {
>   if (jobName.length() > JOB_NAME_TRIM_LENGTH) {
> jobName = jobName.substring(0, JOB_NAME_TRIM_LENGTH);
>   }
>   return jobName;
> }
>
> Roughly speaking, the history file name is composed in the following way:
>
> hostname of JT + "_" + start time of JT + "_" + job id + "_" + user name +
> "_" + trimed job name
>
> Cheers!
>
> Hailong
>
> 2011-07-07
>
>
>
> ***
> * Hailong Yang, PhD. Candidate
> * Sino-German Joint Software Institute,
> * School of Computer Science&Engineering, Beihang University
> * Phone: (86-010)82315908
> * Email: [hidden email]
> * Address: G413, New Main Building in Beihang University,
> *  No.37 XueYuan Road,HaiDian District,
> *  Beijing,P.R.China,100191
> ***
>
>
>
> 发件人: sangroya
> 发送时间: 2011-07-07  15:49:58
> 收件人: hadoop-user
> 抄送:
> 主题: Re: measure the time taken to complete map and reduce phase
>
> Hi,
> Thanks!
> I am able to parse the Job History Logs(JHL). But, I need to know how
> hadoop assign a name to a file in Job History Logs(JHL).
> I can see that files are named on my local single node cluster as this:
> localhost_1309975809398_job_201107062010_0759_sangroya_word+count.
> But, I am just wondering, what is the exact pattern to name every file
> like this.
> Best Regards,
> Amit
> On Tue, Jul 5, 2011 at 6:53 AM, Hailong [via Lucene]
> <[hidden email]> wrote:
>> Hi sangroya,
>>
>> You can look at the job administration portal at port of 50030 on your
>> JobTracker such as '> href="http://localhost:50030'">http://localhost:50030'">> href="http://localhost:50030'">http://localhost:50030'. At the bottom of the
>> web page there is an item named 'Job Tracker History', click into it and
>> find you job with the job id. There goes the information you want.
>>
>>
>> Cheers!
>>
>> Hailong
>>
>> 2011-07-05
>>
>>
>>
>> ***
>> * Hailong Yang, PhD. Candidate
>> * Sino-German Joint Software Institute,
>> * School of Computer Science&Engineering, Beihang University
>> * Phone: (86-010)82315908
>> * Email: [hidden email]
>> * Address: G413, New Main Building in Beihang University,
>> *  No.37 XueYuan Road,HaiDian District,
>> *  Beijing,P.R.China,100191
>> ***
>>
>>
>>
>> 发件人: sangroya
>> 发送时间: 2011-07-05  10:56:38
>> 收件人: hadoop-user
>> 抄送:
>> 主题: measure the time taken to complete map and reduce phase
>>
>> Hi,
>> I am trying to monitor the time to complete a map phase and reduce
>> phase in hadoop. Is there any way to measure the time taken to
>> complete map and reduce phase in a cluster.
>> Thanks,
>> Amit
>> --
>> View this message in context:
>>
>> http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3136991.html
>> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>>
>>
>> 
>> If you reply to this email, your message will be added to the discussion
>> bel

RE: Difference between DFS Used and Non-DFS Used

2011-07-07 Thread Sagar Shukla
Hi Harsh,
 Thanks for your reply.

But why does it require non-DFS storage ? And why that space is accounted 
differently from regular DFS storage ?

Ideally, it should have been part of same storage.

Thanks,
Sagar

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: Thursday, July 07, 2011 6:04 PM
To: common-user@hadoop.apache.org
Subject: Re: Difference between DFS Used and Non-DFS Used

DFS used is a count of all the space used by the dfs.data.dirs. The
non-dfs used space is whatever space is occupied beyond that (which
the DN does not account for).

On Thu, Jul 7, 2011 at 3:29 PM, Sagar Shukla
 wrote:
> Hi,
>       What is the difference between DFS Used and Non-DFS used ?
>
> Thanks,
> Sagar
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>
>



--
Harsh J

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.



Re: HTTP Error

2011-07-07 Thread Adarsh Sharma

Thanks , Still don't understand the issue.

My name node has repeatedly show these logs :

2011-07-08 09:36:31,365 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: 
ugi=hadoop,hadoopip=/MAster-IP   cmd=listStatus
src=/home/hadoop/systemdst=nullperm=null
2011-07-08 09:36:31,367 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 2 on 9000, call delete(/home/hadoop/system, true) from 
Master-IP:53593: error: 
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete 
/home/hadoop/system. Name node is in safe mode.
The ratio of reported blocks 0.8293 has not reached the threshold 
0.9990. Safe mode will be turned off automatically.
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete 
/home/hadoop/system. Name node is in safe mode.
The ratio of reported blocks 0.8293 has not reached the threshold 
0.9990. Safe mode will be turned off automatically.
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1700)
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1680)
   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)

   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)


And one of my data node shows the below logs :

2011-07-08 09:49:56,967 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: 
DNA_REGISTER
2011-07-08 09:49:59,962 WARN 
org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is shutting 
down: org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node 
192.168.0.209:50010 is attempting to report storage ID 
DS-218695497-SLave_IP-50010-1303978807280. Node SLave_IP:50010 is 
expected to serve this storage.
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDatanode(FSNamesystem.java:3920)
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processReport(FSNamesystem.java:2891)
   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.blockReport(NameNode.java:715)

   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

   at org.apache.hadoop.ipc.Client.call(Client.java:740)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
   at $Proxy4.blockReport(Unknown Source)
   at 
org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:756)
   at 
org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1186)

   at java.lang.Thread.run(Thread.java:619)

2011-07-08 09:50:00,072 INFO org.apache.hadoop.ipc.Server: Stopping 
server on 50020
2011-07-08 09:50:00,072 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 1 on 50020: exiting
2011-07-08 09:50:00,074 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 2 on 50020: exiting
2011-07-08 09:50:00,074 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 0 on 50020: exiting
2011-07-08 09:50:00,076 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
Server listener on 50020
2011-07-08 09:50:00,077 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
Server Responder
2011-07-08 09:50:00,077 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for threadgroup 
to exit, active threads is 1
2011-07-08 09:50:00,078 WARN 
org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(SLave_IP:50010, 
storageID=DS-218695497-192.168.0.209-50010-1303978807280, 
infoPort=50075, ipcPort=50020):DataXceiveServer: 
java.nio.channels.AsynchronousCloseException
   at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:185)
   at 
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
   at 
s

Re: Clarification: CDH3 - installation & JDK dependency

2011-07-07 Thread Kumar Kandasami
Note - CDH3 installed worked when replaced OpenJDK with SunJDK.


Kumar_/|\_
www.saisk.com
ku...@saisk.com
"making a profound difference with knowledge and creativity..."


On Thu, Jul 7, 2011 at 8:58 PM, Kumar Kandasami <
kumaravel.kandas...@gmail.com> wrote:

> Hi :
>Want to verify whether CDH3 expects only Sun JDK, and not OpenJDK ?
> This will clarify whether I need to JDK or something else .. Thank You.
> *
> OS: *
>
> Red Hat Enterprise Linux Server release 5.5 (Tikanga)
>
> *Java Version:*
>
> [root@10-165-50-11 yum.repos.d]# java -version
> java version "1.6.0_20"
> OpenJDK Runtime Environment (IcedTea6 1.9.8) (rhel-1.22.1.9.8.el5_6-x86_64)
> OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
>
> *Error Message:*
>
> sudo yum install hadoop-0.20-namenode
>
> Error: Missing Dependency: jdk >= 1.6 is needed by package
> hadoop-0.20-0.20.2+923.21-1.noarch (cloudera-cdh3)
>  You could try using --skip-broken to work around the problem
>  You could try running: package-cleanup --problems
> package-cleanup --dupes
> rpm -Va --nofiles --nodigest
>
>
> Kumar_/|\_
> www.saisk.com
> ku...@saisk.com
> "making a profound difference with knowledge and creativity..."
>


Can i safely set dfs.blockreport.intervalMsec to very large value (1 year or more?)

2011-07-07 Thread moon soo Lee
I have many blocks. Around 50~90m each datanode.

They often do not respond while 1~3 min and i think this is because of full
scanning for block report.

So if i set dfs.blockreport.intervalMsec to very large value (1year or
more?), i expect problem clear.

But if i really do what happens? any side effects?


Clarification: CDH3 - installation & JDK dependency

2011-07-07 Thread Kumar Kandasami
Hi :
   Want to verify whether CDH3 expects only Sun JDK, and not OpenJDK ?
This will clarify whether I need to JDK or something else .. Thank You.
*
OS: *

Red Hat Enterprise Linux Server release 5.5 (Tikanga)

*Java Version:*

[root@10-165-50-11 yum.repos.d]# java -version
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.8) (rhel-1.22.1.9.8.el5_6-x86_64)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

*Error Message:*

sudo yum install hadoop-0.20-namenode

Error: Missing Dependency: jdk >= 1.6 is needed by package
hadoop-0.20-0.20.2+923.21-1.noarch (cloudera-cdh3)
 You could try using --skip-broken to work around the problem
 You could try running: package-cleanup --problems
package-cleanup --dupes
rpm -Va --nofiles --nodigest


Kumar_/|\_
www.saisk.com
ku...@saisk.com
"making a profound difference with knowledge and creativity..."


Re: Hadoop Eclipse plugin doesn't show a dialog for New Hadoop Locations ... on MacOS

2011-07-07 Thread Teruhiko Kurosaka
I've found an Exception in the Eclipse log.

Eclipse complains that it can't find org.apache.hadoop.conf.Configuration.
But I see it is in lib/hadoop-common.jar in the plug-in.jar,
mapred/contrib/eclipse-plugin/hadoop-0.21.0-eclipse-plugin.jar.

Closer look at it shows that META-INF/MANIFEST.MF has a wrong entry:
Bundle-ClassPath: classes/,lib/hadoop-core.jar

Notice that lib/hadoop-core.jar is mentioned instead of
lib/hadoop-common.jar.




Unhandled event loop exception

java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at 
org.apache.hadoop.eclipse.server.HadoopServer.(HadoopServer.java:223)
at 
org.apache.hadoop.eclipse.servers.HadoopLocationWizard.(HadoopLocatio
nWizard.java:88)
at 
org.apache.hadoop.eclipse.actions.NewLocationAction$1.(NewLocationAct
ion.java:41)
at 
org.apache.hadoop.eclipse.actions.NewLocationAction.run(NewLocationAction.j
ava:40)
at org.eclipse.jface.action.Action.runWithEvent(Action.java:498)
at 
org.eclipse.jface.action.ActionContributionItem.handleWidgetSelection(Actio
nContributionItem.java:584)
at 
org.eclipse.jface.action.ActionContributionItem.access$2(ActionContribution
Item.java:501)
at 
org.eclipse.jface.action.ActionContributionItem$6.handleEvent(ActionContrib
utionItem.java:452)
at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:84)
at org.eclipse.swt.widgets.Display.sendEvent(Display.java:3543)
at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1250)
at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1273)
at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1258)
at org.eclipse.swt.widgets.Widget.notifyListeners(Widget.java:1079)
at org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:3441)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3100)
at org.eclipse.ui.internal.Workbench.runEventLoop(Workbench.java:2405)
at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:2369)
at org.eclipse.ui.internal.Workbench.access$4(Workbench.java:2221)
at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:500)
at 
org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:332
)
at 
org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:493)
at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:149)
at 
org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication
.java:113)
at 
org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java
:194)
at 
org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication
(EclipseAppLauncher.java:110)
at 
org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseA
ppLauncher.java:79)
at 
org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:368
)
at 
org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:179
)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:3
9)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp
l.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:559)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:514)
at org.eclipse.equinox.launcher.Main.run(Main.java:1311)





On 7/8/11 7:24 AM, "Teruhiko Kurosaka"  wrote:

>Thanks, but selection of JRE 1.6 didn't help.
>
>
>On 7/7/11 8:54 PM, "Pandu Pradhana"  wrote:
>
>>Hi, 
>>
>>Maybe related the Java version you are using. Try to use Java 1.6
>>
>>Regards,
>>--Pandu
>>
>>On Jul 7, 2011, at 4:03 PM, Teruhiko Kurosaka wrote:
>>
>>> Hi,
>>> I'm new to Hadoop.  I'm trying to set up Eclipse for Hadoop debugging.
>>> I have:
>>> Eclipse 3.5.2, configured to run apps with JRE 1.5.
>>> MacOS 10.6.8
>>> Hadoop 0.21.0
>>> 
>>> I copied mapred/contrib/eclipse-plugin/hadoop-0.21.0-eclipse-plugin.jar
>>> to /Applications/eclipse/plugins
>>> and restarted Eclipse.
>>> I switched to Map/Reduce perspective and I see Map/Reduce Locations tab
>>> next to Problems, Tasks and Javadoc tabs.
>>> I switched to the Map/Reduce Locations tab and right clicked within
>>> the pane and choose "New Hadoop location..." but nothing happens.
>>> A dialog window is supposed to pop up but nothing.
>>> 
>>> Is there any known issues? How do I trace this problem?
>>> 
>>> 
>>> T. "Kuro" Kurosaka
>>> 
>>> 
>>
>



Re: Cluster Tuning

2011-07-07 Thread Ceriasmex
Eres el Esteban que conozco?



El 07/07/2011, a las 15:53, Esteban Gutierrez  escribió:

> Hi Pony,
> 
> There is a good chance that your boxes are doing some heavy swapping and
> that is a killer for Hadoop.  Have you tried
> with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
> heap on that boxes?
> 
> Cheers,
> Esteban.
> 
> --
> Get Hadoop!  http://www.cloudera.com/downloads/
> 
> 
> 
> On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:
> 
>> Hi guys!
>> 
>> I'd like some help fine tuning my cluster. I currently have 20 boxes
>> exactly
>> alike. Single core machines with 600MB of RAM. No chance of upgrading the
>> hardware.
>> 
>> My cluster is made out of 1 NameNode/JobTracker box and 19
>> DataNode/TaskTracker boxes.
>> 
>> All my config is default except i've set the following in my
>> mapred-site.xml
>> in an effort to try and prevent choking my boxes.
>> **
>> *  mapred.tasktracker.map.tasks.maximum*
>> *  1*
>> *  *
>> 
>> I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
>> hosts to each record and then in the reduce task it accumulates the amount
>> of bytes received from each host.
>> 
>> Currently it's producing about 65000 keys
>> 
>> The hole job takes forever to complete, specially the reduce part. I've
>> tried different tuning configs by I can't bring it down under 20mins.
>> 
>> Any ideas?
>> 
>> Thanks for your help!
>> Pony
>> 


Re: Hadoop Eclipse plugin doesn't show a dialog for New Hadoop Locations ... on MacOS

2011-07-07 Thread Teruhiko Kurosaka
Thanks, but selection of JRE 1.6 didn't help.


On 7/7/11 8:54 PM, "Pandu Pradhana"  wrote:

>Hi, 
>
>Maybe related the Java version you are using. Try to use Java 1.6
>
>Regards,
>--Pandu
>
>On Jul 7, 2011, at 4:03 PM, Teruhiko Kurosaka wrote:
>
>> Hi,
>> I'm new to Hadoop.  I'm trying to set up Eclipse for Hadoop debugging.
>> I have:
>> Eclipse 3.5.2, configured to run apps with JRE 1.5.
>> MacOS 10.6.8
>> Hadoop 0.21.0
>> 
>> I copied mapred/contrib/eclipse-plugin/hadoop-0.21.0-eclipse-plugin.jar
>> to /Applications/eclipse/plugins
>> and restarted Eclipse.
>> I switched to Map/Reduce perspective and I see Map/Reduce Locations tab
>> next to Problems, Tasks and Javadoc tabs.
>> I switched to the Map/Reduce Locations tab and right clicked within
>> the pane and choose "New Hadoop location..." but nothing happens.
>> A dialog window is supposed to pop up but nothing.
>> 
>> Is there any known issues? How do I trace this problem?
>> 
>> 
>> T. "Kuro" Kurosaka
>> 
>> 
>



Re: Cluster Tuning

2011-07-07 Thread Esteban Gutierrez
Hi Pony,

There is a good chance that your boxes are doing some heavy swapping and
that is a killer for Hadoop.  Have you tried
with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
heap on that boxes?

Cheers,
Esteban.

--
Get Hadoop!  http://www.cloudera.com/downloads/



On Thu, Jul 7, 2011 at 1:29 PM, Juan P.  wrote:

> Hi guys!
>
> I'd like some help fine tuning my cluster. I currently have 20 boxes
> exactly
> alike. Single core machines with 600MB of RAM. No chance of upgrading the
> hardware.
>
> My cluster is made out of 1 NameNode/JobTracker box and 19
> DataNode/TaskTracker boxes.
>
> All my config is default except i've set the following in my
> mapred-site.xml
> in an effort to try and prevent choking my boxes.
>  **
> *  mapred.tasktracker.map.tasks.maximum*
> *  1*
> *  *
>
> I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
> hosts to each record and then in the reduce task it accumulates the amount
> of bytes received from each host.
>
> Currently it's producing about 65000 keys
>
> The hole job takes forever to complete, specially the reduce part. I've
> tried different tuning configs by I can't bring it down under 20mins.
>
> Any ideas?
>
> Thanks for your help!
> Pony
>


Re: Cluster Tuning

2011-07-07 Thread Joey Echeverria
Have you tried using a Combiner?

Here's an example of using one:

http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Example%3A+WordCount+v1.0

-Joey

On Thu, Jul 7, 2011 at 4:29 PM, Juan P.  wrote:
> Hi guys!
>
> I'd like some help fine tuning my cluster. I currently have 20 boxes exactly
> alike. Single core machines with 600MB of RAM. No chance of upgrading the
> hardware.
>
> My cluster is made out of 1 NameNode/JobTracker box and 19
> DataNode/TaskTracker boxes.
>
> All my config is default except i've set the following in my mapred-site.xml
> in an effort to try and prevent choking my boxes.
>  **
> *      mapred.tasktracker.map.tasks.maximum*
> *      1*
> *  *
>
> I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
> hosts to each record and then in the reduce task it accumulates the amount
> of bytes received from each host.
>
> Currently it's producing about 65000 keys
>
> The hole job takes forever to complete, specially the reduce part. I've
> tried different tuning configs by I can't bring it down under 20mins.
>
> Any ideas?
>
> Thanks for your help!
> Pony
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Cluster Tuning

2011-07-07 Thread Juan P.
Hi guys!

I'd like some help fine tuning my cluster. I currently have 20 boxes exactly
alike. Single core machines with 600MB of RAM. No chance of upgrading the
hardware.

My cluster is made out of 1 NameNode/JobTracker box and 19
DataNode/TaskTracker boxes.

All my config is default except i've set the following in my mapred-site.xml
in an effort to try and prevent choking my boxes.
  **
*  mapred.tasktracker.map.tasks.maximum*
*  1*
*  *

I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
hosts to each record and then in the reduce task it accumulates the amount
of bytes received from each host.

Currently it's producing about 65000 keys

The hole job takes forever to complete, specially the reduce part. I've
tried different tuning configs by I can't bring it down under 20mins.

Any ideas?

Thanks for your help!
Pony


Re: Re: measure the time taken to complete map and reduce phase

2011-07-07 Thread sangroya
Hi,

Thanks for the response!

I have the following queries regarding the Job History file.

I want to know if what the TOTAL_MAPS in the Job history represents.

Also, if FINISHED_MAPS represents the TOTAL_MAPS or the (TOTAL_MAPS -
FAILED_MAPS).

Does FINISHED_MAPS represents successfully executed maps.

I have the same question for REDUCE tasks.

Thanks,
Amit



On Thu, Jul 7, 2011 at 10:58 AM, Hailong [via Lucene]
 wrote:
> Hi sangroya,
>
> I think you may be interested in reading the following piece of code from
> JobHistory.java in Hadoop.
>
> /**
>  * Generates the job history filename for a new job
>  */
> private static String getNewJobHistoryFileName(JobConf jobConf, JobID
> id) {
>   return JOBTRACKER_UNIQUE_STRING
>  + id.toString() + "_" + getUserName(jobConf) + "_"
>  + trimJobName(getJobName(jobConf));
> }
>
> /**
>  * Trims the job-name if required
>  */
> private static String trimJobName(String jobName) {
>   if (jobName.length() > JOB_NAME_TRIM_LENGTH) {
> jobName = jobName.substring(0, JOB_NAME_TRIM_LENGTH);
>   }
>   return jobName;
> }
>
> Roughly speaking, the history file name is composed in the following way:
>
> hostname of JT + "_" + start time of JT + "_" + job id + "_" + user name +
> "_" + trimed job name
>
> Cheers!
>
> Hailong
>
> 2011-07-07
>
>
>
> ***
> * Hailong Yang, PhD. Candidate
> * Sino-German Joint Software Institute,
> * School of Computer Science&Engineering, Beihang University
> * Phone: (86-010)82315908
> * Email: [hidden email]
> * Address: G413, New Main Building in Beihang University,
> *  No.37 XueYuan Road,HaiDian District,
> *  Beijing,P.R.China,100191
> ***
>
>
>
> 发件人: sangroya
> 发送时间: 2011-07-07  15:49:58
> 收件人: hadoop-user
> 抄送:
> 主题: Re: measure the time taken to complete map and reduce phase
>
> Hi,
> Thanks!
> I am able to parse the Job History Logs(JHL). But, I need to know how
> hadoop assign a name to a file in Job History Logs(JHL).
> I can see that files are named on my local single node cluster as this:
> localhost_1309975809398_job_201107062010_0759_sangroya_word+count.
> But, I am just wondering, what is the exact pattern to name every file
> like this.
> Best Regards,
> Amit
> On Tue, Jul 5, 2011 at 6:53 AM, Hailong [via Lucene]
> <[hidden email]> wrote:
>> Hi sangroya,
>>
>> You can look at the job administration portal at port of 50030 on your
>> JobTracker such as '> href="http://localhost:50030'">http://localhost:50030'">> href="http://localhost:50030'">http://localhost:50030'. At the bottom of the
>> web page there is an item named 'Job Tracker History', click into it and
>> find you job with the job id. There goes the information you want.
>>
>>
>> Cheers!
>>
>> Hailong
>>
>> 2011-07-05
>>
>>
>>
>> ***
>> * Hailong Yang, PhD. Candidate
>> * Sino-German Joint Software Institute,
>> * School of Computer Science&Engineering, Beihang University
>> * Phone: (86-010)82315908
>> * Email: [hidden email]
>> * Address: G413, New Main Building in Beihang University,
>> *  No.37 XueYuan Road,HaiDian District,
>> *  Beijing,P.R.China,100191
>> ***
>>
>>
>>
>> 发件人: sangroya
>> 发送时间: 2011-07-05  10:56:38
>> 收件人: hadoop-user
>> 抄送:
>> 主题: measure the time taken to complete map and reduce phase
>>
>> Hi,
>> I am trying to monitor the time to complete a map phase and reduce
>> phase in hadoop. Is there any way to measure the time taken to
>> complete map and reduce phase in a cluster.
>> Thanks,
>> Amit
>> --
>> View this message in context:
>>
>> http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3136991.html
>> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>>
>>
>> 
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3139665.html
>> To unsubscribe from measure the time taken to complete map and reduce
>> phase,
>> click here.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3147426.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>
>
> 
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3147566.html
> To unsubscribe from measure the time taken to complete map and reduce phase,
> click here.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136

RE: HTTP Error

2011-07-07 Thread Jeff.Schmitz
Adarsh,

You could also run from command line

[root@xxx bin]# ./hadoop dfsadmin -report
Configured Capacity: 1151948095488 (1.05 TB)
Present Capacity: 1059350446080 (986.6 GB)
DFS Remaining: 1056175992832 (983.64 GB)
DFS Used: 3174453248 (2.96 GB)
DFS Used%: 0.3%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-
Datanodes available: 5 (5 total, 0 dead)




-Original Message-
From: dhru...@gmail.com [mailto:dhru...@gmail.com] On Behalf Of Dhruv
Kumar
Sent: Thursday, July 07, 2011 10:01 AM
To: common-user@hadoop.apache.org
Subject: Re: HTTP Error

1) Check with jps to see if all services are functioning.

2) Have you tried appending dfshealth.jsp at the end of the URL as the
404
says?

Try using this:
http://localhost:50070/dfshealth.jsp



On Thu, Jul 7, 2011 at 7:13 AM, Adarsh Sharma
wrote:

> Dear all,
>
> Today I am stucked with the strange problem in the running hadoop
cluster.
>
> After starting hadoop by bin/start-all.sh, all nodes are started. But
when
> I check through web UI ( MAster-Ip:50070), It shows :
>
>
>   HTTP ERROR: 404
>
> /dfshealth.jsp
>
> RequestURI=/dfshealth.jsp
>
> /Powered by Jetty:// 
> /
>
> /I check by command line that hadoop cannot able to get out of safe
mode.
> /
>
> /I know , manually command to leave safe mode
> /
>
> /bin/hadoop dfsadmin -safemode leave
> /
>
> /But How can I make hadoop  run properly and what are the reasons of
this
> error
> /
>
> /
> Thanks
> /
>
>
>



Re: HTTP Error

2011-07-07 Thread Dhruv Kumar
1) Check with jps to see if all services are functioning.

2) Have you tried appending dfshealth.jsp at the end of the URL as the 404
says?

Try using this:
http://localhost:50070/dfshealth.jsp



On Thu, Jul 7, 2011 at 7:13 AM, Adarsh Sharma wrote:

> Dear all,
>
> Today I am stucked with the strange problem in the running hadoop cluster.
>
> After starting hadoop by bin/start-all.sh, all nodes are started. But when
> I check through web UI ( MAster-Ip:50070), It shows :
>
>
>   HTTP ERROR: 404
>
> /dfshealth.jsp
>
> RequestURI=/dfshealth.jsp
>
> /Powered by Jetty:// 
> /
>
> /I check by command line that hadoop cannot able to get out of safe mode.
> /
>
> /I know , manually command to leave safe mode
> /
>
> /bin/hadoop dfsadmin -safemode leave
> /
>
> /But How can I make hadoop  run properly and what are the reasons of  this
> error
> /
>
> /
> Thanks
> /
>
>
>


Re: parallel cat

2011-07-07 Thread Rita
Thanks again Steve.

I will try to implement it with thrift.


On Thu, Jul 7, 2011 at 5:35 AM, Steve Loughran  wrote:

> On 07/07/11 08:22, Rita wrote:
>
>> Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
>> see any example code for the implementation.
>>
>>
> No. I think I have access to russ's source somewhere, but there'd be
> paperwork in getting it released. Russ said it wasn't too hard to do, he
> just had to patch the DFS client to offer up the entire list of block
> locations to the client, and let the client program make the decision. If
> you discussed this on the hdfs-dev list (via a JIRA), you may be able to get
> a patch for this accepted, though you have to do the code and tests
> yourself.
>
>
>> On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran  wrote:
>>
>>  On 06/07/11 11:08, Rita wrote:
>>>
>>>  I have many large files ranging from 2gb to 800gb and I use hadoop fs
 -cat
 a
 lot to pipe to various programs.

 I was wondering if its possible to prefetch the data for clients with
 more
 bandwidth. Most of my clients have 10g interface and datanodes are 1g.

 I was thinking, prefetch x blocks (even though it will cost extra
 memory)
 while reading block y. After block y is read, read the prefetched
 blocked
 and then throw it away.

 It should be used like this:


 export PREFETCH_BLOCKS=2 #default would be 1
 hadoop fs -pcat hdfs://namenode/verylarge file | program

 Any thoughts?


  Look at Russ Perry's work on doing very fast fetches from an HDFS
>>> filestore
>>> http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
>>> 
>>> >
>>>
>>>
>>> Here the DFS client got some extra data on where every copy of every
>>> block
>>> was, and the client decided which machine to fetch it from. This made the
>>> best use of the entire cluster, by keeping each datanode busy.
>>>
>>>
>>> -steve
>>>
>>>
>>
>>
>>
>


-- 
--- Get your facts first, then you can distort them as you please.--


Re: Difference between DFS Used and Non-DFS Used

2011-07-07 Thread Harsh J
DFS used is a count of all the space used by the dfs.data.dirs. The
non-dfs used space is whatever space is occupied beyond that (which
the DN does not account for).

On Thu, Jul 7, 2011 at 3:29 PM, Sagar Shukla
 wrote:
> Hi,
>       What is the difference between DFS Used and Non-DFS used ?
>
> Thanks,
> Sagar
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>
>



-- 
Harsh J


Re: Hadoop Eclipse plugin doesn't show a dialog for New Hadoop Locations ... on MacOS

2011-07-07 Thread Pandu Pradhana
Hi, 

Maybe related the Java version you are using. Try to use Java 1.6

Regards,
--Pandu

On Jul 7, 2011, at 4:03 PM, Teruhiko Kurosaka wrote:

> Hi,
> I'm new to Hadoop.  I'm trying to set up Eclipse for Hadoop debugging.
> I have:
> Eclipse 3.5.2, configured to run apps with JRE 1.5.
> MacOS 10.6.8
> Hadoop 0.21.0
> 
> I copied mapred/contrib/eclipse-plugin/hadoop-0.21.0-eclipse-plugin.jar
> to /Applications/eclipse/plugins
> and restarted Eclipse.
> I switched to Map/Reduce perspective and I see Map/Reduce Locations tab
> next to Problems, Tasks and Javadoc tabs.
> I switched to the Map/Reduce Locations tab and right clicked within
> the pane and choose "New Hadoop location..." but nothing happens.
> A dialog window is supposed to pop up but nothing.
> 
> Is there any known issues? How do I trace this problem?
> 
> 
> T. "Kuro" Kurosaka
> 
> 



HTTP Error

2011-07-07 Thread Adarsh Sharma

Dear all,

Today I am stucked with the strange problem in the running hadoop cluster.

After starting hadoop by bin/start-all.sh, all nodes are started. But 
when I check through web UI ( MAster-Ip:50070), It shows :



   HTTP ERROR: 404

/dfshealth.jsp

RequestURI=/dfshealth.jsp

/Powered by Jetty:// 
/

/I check by command line that hadoop cannot able to get out of safe mode.
/

/I know , manually command to leave safe mode
/

/bin/hadoop dfsadmin -safemode leave
/

/But How can I make hadoop  run properly and what are the reasons of  
this error

/

/
Thanks
/




Difference between DFS Used and Non-DFS Used

2011-07-07 Thread Sagar Shukla
Hi,
   What is the difference between DFS Used and Non-DFS used ?

Thanks,
Sagar

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.



Re: parallel cat

2011-07-07 Thread Steve Loughran

On 07/07/11 08:22, Rita wrote:

Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
see any example code for the implementation.



No. I think I have access to russ's source somewhere, but there'd be 
paperwork in getting it released. Russ said it wasn't too hard to do, he 
just had to patch the DFS client to offer up the entire list of block 
locations to the client, and let the client program make the decision. 
If you discussed this on the hdfs-dev list (via a JIRA), you may be able 
to get a patch for this accepted, though you have to do the code and 
tests yourself.




On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran  wrote:


On 06/07/11 11:08, Rita wrote:


I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat
a
lot to pipe to various programs.

I was wondering if its possible to prefetch the data for clients with more
bandwidth. Most of my clients have 10g interface and datanodes are 1g.

I was thinking, prefetch x blocks (even though it will cost extra memory)
while reading block y. After block y is read, read the prefetched blocked
and then throw it away.

It should be used like this:


export PREFETCH_BLOCKS=2 #default would be 1
hadoop fs -pcat hdfs://namenode/verylarge file | program

Any thoughts?



Look at Russ Perry's work on doing very fast fetches from an HDFS filestore
http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf

Here the DFS client got some extra data on where every copy of every block
was, and the client decided which machine to fetch it from. This made the
best use of the entire cluster, by keeping each datanode busy.


-steve









Hadoop Eclipse plugin doesn't show a dialog for New Hadoop Locations ... on MacOS

2011-07-07 Thread Teruhiko Kurosaka
Hi,
I'm new to Hadoop.  I'm trying to set up Eclipse for Hadoop debugging.
I have:
Eclipse 3.5.2, configured to run apps with JRE 1.5.
MacOS 10.6.8
Hadoop 0.21.0

I copied mapred/contrib/eclipse-plugin/hadoop-0.21.0-eclipse-plugin.jar
to /Applications/eclipse/plugins
and restarted Eclipse.
I switched to Map/Reduce perspective and I see Map/Reduce Locations tab
next to Problems, Tasks and Javadoc tabs.
I switched to the Map/Reduce Locations tab and right clicked within
the pane and choose "New Hadoop location..." but nothing happens.
A dialog window is supposed to pop up but nothing.

Is there any known issues? How do I trace this problem?


T. "Kuro" Kurosaka




Re: Re: measure the time taken to complete map and reduce phase

2011-07-07 Thread hailong.yang1115
Hi sangroya,

I think you may be interested in reading the following piece of code from 
JobHistory.java in Hadoop.

/**
 * Generates the job history filename for a new job
 */
private static String getNewJobHistoryFileName(JobConf jobConf, JobID id) {
  return JOBTRACKER_UNIQUE_STRING
 + id.toString() + "_" + getUserName(jobConf) + "_" 
 + trimJobName(getJobName(jobConf));
}

/**
 * Trims the job-name if required
 */
private static String trimJobName(String jobName) {
  if (jobName.length() > JOB_NAME_TRIM_LENGTH) {
jobName = jobName.substring(0, JOB_NAME_TRIM_LENGTH);
  }
  return jobName;
}

Roughly speaking, the history file name is composed in the following way:

hostname of JT + "_" + start time of JT + "_" + job id + "_" + user name + "_" 
+ trimed job name

Cheers!

Hailong

2011-07-07 



***
* Hailong Yang, PhD. Candidate 
* Sino-German Joint Software Institute, 
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1...@gmail.com
* Address: G413, New Main Building in Beihang University, 
*  No.37 XueYuan Road,HaiDian District, 
*  Beijing,P.R.China,100191
***



发件人: sangroya 
发送时间: 2011-07-07  15:49:58 
收件人: hadoop-user 
抄送: 
主题: Re: measure the time taken to complete map and reduce phase 
 
Hi,
Thanks!
I am able to parse the Job History Logs(JHL). But, I need to know how
hadoop assign a name to a file in Job History Logs(JHL).
I can see that files are named on my local single node cluster as this:
localhost_1309975809398_job_201107062010_0759_sangroya_word+count.
But, I am just wondering, what is the exact pattern to name every file
like this.
Best Regards,
Amit
On Tue, Jul 5, 2011 at 6:53 AM, Hailong [via Lucene]
 wrote:
> Hi sangroya,
>
> You can look at the job administration portal at port of 50030 on your
> JobTracker such as ' href="http://localhost:50030'">http://localhost:50030'. At the bottom of the
> web page there is an item named 'Job Tracker History', click into it and
> find you job with the job id. There goes the information you want.
>
>
> Cheers!
>
> Hailong
>
> 2011-07-05
>
>
>
> ***
> * Hailong Yang, PhD. Candidate
> * Sino-German Joint Software Institute,
> * School of Computer Science&Engineering, Beihang University
> * Phone: (86-010)82315908
> * Email: [hidden email]
> * Address: G413, New Main Building in Beihang University,
> *  No.37 XueYuan Road,HaiDian District,
> *  Beijing,P.R.China,100191
> ***
>
>
>
> 发件人: sangroya
> 发送时间: 2011-07-05  10:56:38
> 收件人: hadoop-user
> 抄送:
> 主题: measure the time taken to complete map and reduce phase
>
> Hi,
> I am trying to monitor the time to complete a map phase and reduce
> phase in hadoop. Is there any way to measure the time taken to
> complete map and reduce phase in a cluster.
> Thanks,
> Amit
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3136991.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>
>
> 
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3139665.html
> To unsubscribe from measure the time taken to complete map and reduce phase,
> click here.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3147426.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: thrift and python

2011-07-07 Thread Rita
Could someone please compile and provide the jar for this class? It would be
much appreciated. I am running
r0.21.0/



On Thu, Jul 7, 2011 at 3:56 AM, Rita  wrote:

> By looking at this, h
> ttp://www.mail-archive.com/mapreduce-dev@hadoop.apache.org/msg02088.html
>
> Is it still necessary to compile the jar to resolve,
>
> Could not find the main class:
>
> org.apache.hadoop.thriftfs.HadoopThriftServer. Program will exit.
>
> I would think the .jar would exist on the latest version of hadoop/hdfs
>
>
> --
> --- Get your facts first, then you can distort them as you please.--
>



-- 
--- Get your facts first, then you can distort them as you please.--


thrift and python

2011-07-07 Thread Rita
By looking at this, h
ttp://www.mail-archive.com/mapreduce-dev@hadoop.apache.org/msg02088.html

Is it still necessary to compile the jar to resolve,

Could not find the main class:

org.apache.hadoop.thriftfs.HadoopThriftServer. Program will exit.

I would think the .jar would exist on the latest version of hadoop/hdfs


-- 
--- Get your facts first, then you can distort them as you please.--


Re: measure the time taken to complete map and reduce phase

2011-07-07 Thread sangroya
Hi,

Thanks!

I am able to parse the Job History Logs(JHL). But, I need to know how
hadoop assign a name to a file in Job History Logs(JHL).

I can see that files are named on my local single node cluster as this:

localhost_1309975809398_job_201107062010_0759_sangroya_word+count.

But, I am just wondering, what is the exact pattern to name every file
like this.

Best Regards,
Amit

On Tue, Jul 5, 2011 at 6:53 AM, Hailong [via Lucene]
 wrote:
> Hi sangroya,
>
> You can look at the job administration portal at port of 50030 on your
> JobTracker such as ' href="http://localhost:50030'">http://localhost:50030'. At the bottom of the
> web page there is an item named 'Job Tracker History', click into it and
> find you job with the job id. There goes the information you want.
>
>
> Cheers!
>
> Hailong
>
> 2011-07-05
>
>
>
> ***
> * Hailong Yang, PhD. Candidate
> * Sino-German Joint Software Institute,
> * School of Computer Science&Engineering, Beihang University
> * Phone: (86-010)82315908
> * Email: [hidden email]
> * Address: G413, New Main Building in Beihang University,
> *  No.37 XueYuan Road,HaiDian District,
> *  Beijing,P.R.China,100191
> ***
>
>
>
> 发件人: sangroya
> 发送时间: 2011-07-05  10:56:38
> 收件人: hadoop-user
> 抄送:
> 主题: measure the time taken to complete map and reduce phase
>
> Hi,
> I am trying to monitor the time to complete a map phase and reduce
> phase in hadoop. Is there any way to measure the time taken to
> complete map and reduce phase in a cluster.
> Thanks,
> Amit
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3136991.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>
>
> 
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3139665.html
> To unsubscribe from measure the time taken to complete map and reduce phase,
> click here.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/measure-the-time-taken-to-complete-map-and-reduce-phase-tp3136991p3147426.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: parallel cat

2011-07-07 Thread Rita
Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
see any example code for the implementation.



On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran  wrote:

> On 06/07/11 11:08, Rita wrote:
>
>> I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat
>> a
>> lot to pipe to various programs.
>>
>> I was wondering if its possible to prefetch the data for clients with more
>> bandwidth. Most of my clients have 10g interface and datanodes are 1g.
>>
>> I was thinking, prefetch x blocks (even though it will cost extra memory)
>> while reading block y. After block y is read, read the prefetched blocked
>> and then throw it away.
>>
>> It should be used like this:
>>
>>
>> export PREFETCH_BLOCKS=2 #default would be 1
>> hadoop fs -pcat hdfs://namenode/verylarge file | program
>>
>> Any thoughts?
>>
>>
> Look at Russ Perry's work on doing very fast fetches from an HDFS filestore
> http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf
>
> Here the DFS client got some extra data on where every copy of every block
> was, and the client decided which machine to fetch it from. This made the
> best use of the entire cluster, by keeping each datanode busy.
>
>
> -steve
>



-- 
--- Get your facts first, then you can distort them as you please.--