Re: mapr common library?
Thanks guys and sorry for not being more specific but yes cloud9 and mahout are definitely what I'm look for; much appreciated. On Wed, Oct 19, 2011 at 9:23 PM, Harsh J ha...@cloudera.com wrote: Alex, I know of Cloud9 http://lintool.github.com/Cloud9/index.html as a library that caters to Hadoop MapReduce specifically, but am sure there are others. I'm just not sure of a very prominent ones. Apache Mahout carries MR code in it as well for, and you might be interested in checking it out at http://mahout.apache.org. On Thu, Oct 20, 2011 at 8:36 AM, Alex Gauthier alexgauthie...@gmail.com wrote: Is there such a thing somewhere? I have the basic nPath, lucene-like search processing but looking for ETL like transformations, typical weblog processor or clickstream. Anything beyond wordcount would be appreciated :) GodSpeed. Alex http://twitter.com/#!/A23Corp -- Harsh J
Re: execute hadoop job from remote web application
On 18/10/11 17:56, Harsh J wrote: Oleg, It will pack up the jar that contains the class specified by setJarByClass into its submission jar and send it up. Thats the function of that particular API method. So, your deduction is almost right there :) On Tue, Oct 18, 2011 at 10:20 PM, Oleg Ruchovetsoruchov...@gmail.com wrote: So you mean that in case I am going to submit job remotely and my_hadoop_job.jar will be in class path of my web application it will submit job with my_hadoop_job.jar to remote hadoop machine (cluster)? There's also the problem of waiting for your work to finish. If you want to see something complicated that does everything but JAR upload, I have some code here that listens for events coming out of the job and so builds up a history of what is happening. It also does better preflight checking of source and dest data directories http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/mapreduce/submitter/SubmitterImpl.java
Re: Is there a good way to see how full hdfs is
Hi, I have same question regarding the documentation and : Is there something like this for memory and CPU utilization also ? Sent from my iPhone Thanks, JJ On Oct 19, 2011, at 5:00 PM, Rajiv Chittajallu raj...@yahoo-inc.com wrote: ivan.nov...@emc.com wrote on 10/18/11 at 09:23:50 -0700: Cool, is there any documentation on how to use the JMX stuff to get monitoring data? I don't know if there is any specific documentation. These are the mbeans you might be interested in Namenode: Hadoop:service=NameNode,name=FSNamesystemState Hadoop:service=NameNode,name=NameNodeInfo Hadoop:service=NameNode,name=jvm JobTracker: Hadoop:service=JobTracker,name=JobTrackerInfo Hadoop:service=JobTracker,name=QueueMetrics,q=queuename Hadoop:service=JobTracker,name=jvm DataNode: Hadoop:name=DataNodeInfo,service=DataNode TaskTracker: Hadoop:service=TaskTracker,name=TaskTrackerInfo You may also want to monitor shuffle_exceptions_caught in Hadoop:service=TaskTracker,name=ShuffleServerMetrics Cheers, Ivan On 10/17/11 6:04 PM, Rajiv Chittajallu raj...@yahoo-inc.com wrote: If you are running 0.20.204 http://phanpy-nn1.hadoop.apache.org:50070/jmx?qry=Hadoop:service=NameNode, name=NameNodeInfo ivan.nov...@emc.com wrote on 10/17/11 at 09:18:20 -0700: Hi Harsh, I need access to the data programatically for system automation, and hence I do not want a monitoring tool but access to the raw data. I am more than happy to use an exposed function or client program and not an internal API. So i am still a bit confused... What is the simplest way to get at this raw disk usage data programmatically? Is there a HDFS equivalent of du and df, or are you suggesting to just run that on the linux OS (which is perfectly doable). Cheers, Ivan On 10/17/11 9:05 AM, Harsh J ha...@cloudera.com wrote: Uma/Ivan, The DistributedFileSystem class explicitly is _not_ meant for public consumption, it is an internal one. Additionally, that method has been deprecated. What you need is FileSystem#getStatus() if you want the summarized report via code. A job, that possibly runs du or df, is a good idea if you guarantee perfect homogeneity of path names in your cluster. But I wonder, why won't using a general monitoring tool (such as nagios) for this purpose cut it? What's the end goal here? P.s. I'd moved this conversation to hdfs-user@ earlier on, but now I see it being cross posted into mr-user, common-user, and common-dev -- Why? On Mon, Oct 17, 2011 at 9:25 PM, Uma Maheswara Rao G 72686 mahesw...@huawei.com wrote: We can write the simple program and you can call this API. Make sure Hadoop jars presents in your class path. Just for more clarification, DN will send their stats as parts of hertbeats, So, NN will maintain all the statistics about the diskspace usage for the complete filesystem and etc... This api will give you that stats. Regards, Uma - Original Message - From: ivan.nov...@emc.com Date: Monday, October 17, 2011 9:07 pm Subject: Re: Is there a good way to see how full hdfs is To: common-user@hadoop.apache.org, mapreduce-u...@hadoop.apache.org Cc: common-...@hadoop.apache.org So is there a client program to call this? Can one write their own simple client to call this method from all diskson the cluster? How about a map reduce job to collect from all disks on the cluster? On 10/15/11 4:51 AM, Uma Maheswara Rao G 72686 mahesw...@huawei.comwrote: /** Return the disk usage of the filesystem, including total capacity, * used space, and remaining space */ public DiskStatus getDiskStatus() throws IOException { return dfs.getDiskStatus(); } DistributedFileSystem has the above API from java API side. Regards, Uma - Original Message - From: wd w...@wdicc.com Date: Saturday, October 15, 2011 4:16 pm Subject: Re: Is there a good way to see how full hdfs is To: mapreduce-u...@hadoop.apache.org hadoop dfsadmin -report On Sat, Oct 15, 2011 at 8:16 AM, Steve Lewis lordjoe2...@gmail.com wrote: We have a small cluster with HDFS running on only 8 nodes - I believe that the partition assigned to hdfs might be getting full and wonder if the web tools or java api havew a way to look at free space on hdfs -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Harsh J
Fixing Mis-replicated blocks
After a hardware move with an unfortunate mis-setup rack awareness script our hadoop cluster has a large number of mis-replicated blocks. After about a week things haven't gotten better on their own. Is there a good way to trigger the name node to fix the mis-replicated blocks? Here's what I'm using for now, but it is very slow: for f in `hadoop fsck / | grep Replica placement policy is violated | head -n3000 | awk -F: '{print $1}'`; do hadoop fs -setrep 4 $f hadoop fs -setrep 3 $f done John
Capacity Scheduler : how to use more than the queue capacity ?
Hi, By choosing the capacity scheduler, I was under the impression that each queue could borrow other queues' resources if they are available. Let's say we have the configuration below, and a total capacity of 180 slots. What I expect is that whenever default and cpu-bound queues have no job, then jobs submitted to io-bound should be able to borrow up to 90 slots (50% total capacity). However, it looks like it never gets above 59 slots (33% of 180 slots). Is there something I missed ? Thanks, Sami Dalouche --- property namemapred.capacity-scheduler.queue.default.capacity/name value33/value /property property namemapred.capacity-scheduler.queue.default.maximum-capacity/name value50/value /property property namemapred.capacity-scheduler.queue.default.supports-priority/name valuetrue/value /property !-- queue: io-bound -- property namemapred.capacity-scheduler.queue.io-bound.capacity/name value33/value /property property namemapred.capacity-scheduler.queue.io-bound.maximum-capacity/name value50/value /property property namemapred.capacity-scheduler.queue.io-bound.supports-priority/name valuetrue/value /property !-- queue: cpu-bound -- property namemapred.capacity-scheduler.queue.cpu-bound.capacity/name value34/value /property property namemapred.capacity-scheduler.queue.cpu-bound.maximum-capacity/name value100/value /property property namemapred.capacity-scheduler.queue.cpu-bound.supports-priority/name valuetrue/value /property
Re: Capacity Scheduler : how to use more than the queue capacity ?
Hi, I ended up finding another post about the exact same issue on this exact same mailing list, that was just a few days old... It looks like the setting to play with is mapred.capacity-scheduler.default-user-limit-factor Sami On Thu, Oct 20, 2011 at 1:25 PM, Sami Dalouche sa...@hopper.com wrote: Hi, By choosing the capacity scheduler, I was under the impression that each queue could borrow other queues' resources if they are available. Let's say we have the configuration below, and a total capacity of 180 slots. What I expect is that whenever default and cpu-bound queues have no job, then jobs submitted to io-bound should be able to borrow up to 90 slots (50% total capacity). However, it looks like it never gets above 59 slots (33% of 180 slots). Is there something I missed ? Thanks, Sami Dalouche --- property namemapred.capacity-scheduler.queue.default.capacity/name value33/value /property property namemapred.capacity-scheduler.queue.default.maximum-capacity/name value50/value /property property namemapred.capacity-scheduler.queue.default.supports-priority/name valuetrue/value /property !-- queue: io-bound -- property namemapred.capacity-scheduler.queue.io-bound.capacity/name value33/value /property property namemapred.capacity-scheduler.queue.io-bound.maximum-capacity/name value50/value /property property namemapred.capacity-scheduler.queue.io-bound.supports-priority/name valuetrue/value /property !-- queue: cpu-bound -- property namemapred.capacity-scheduler.queue.cpu-bound.capacity/name value34/value /property property namemapred.capacity-scheduler.queue.cpu-bound.maximum-capacity/name value100/value /property property namemapred.capacity-scheduler.queue.cpu-bound.supports-priority/name valuetrue/value /property
Re: Hadoop archive
Could you try 0.20.205.0? The HAR issue in branch-20-security was updated by JIRA HADOOP-7539. -Original Message- From: Jonas Hartwig jonas.hart...@cision.com Reply-To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Mon, 17 Oct 2011 02:11:24 -0700 To: common-user@hadoop.apache.org common-user@hadoop.apache.org Subject: Hadoop archive Hi, im new to the community. Id like to create an archive but I get the error: Exception in archives null. Im using hadoop 0.204.0. the issue was tracked under MAPREDUCE-1399 https://issues.apache.org/jira/browse/MAPREDUCE-1399 and solved. How do I combine my hadoop version with a new map/reduce release? And how do I get the release using firefox? I saw something like JIRA but the firefox plugin is not working with 7.x. regards
Connecting to vm through java
Hi guys : im getting the dreaded org.apache.hadoop.ipc.Client$Connection handleConnectionFailure When connecting to clouderas hadoop (running in a vm) to request running a simple m/r job (from a machine outside the hadoop vm).. I've seen a lot of posts about this online, and it's also on stack overflow here : http://stackoverflow.com/questions/6997327/connecting-to-cloudera-vm-from-my-desktop Any tips on debugging Javas connection to hdfs over the network? It's not entirely clear to me how the connection is made/authenticated between the client and hadoop, for example, is a passwordless ssh file required..? I believe this error is related to authentication but am not sure the best way to test it... I have confirmed that the ip is valid And it appears that hdfs is being run and served over the right default port in the vm. Sent from my iPad
running sqoop on hadoop cluster
Hi All, i'm newbie on hadoop, if i installed hadoop on 2 node, where is hdfs running ? on master or slave node ? and then if i running sqoop for export dbms to hive, is it give effect on speed up system between hadoop which running on single node and hadoop multi node ? please give me explaining ? Tks -- View this message in context: http://old.nabble.com/running-sqoop-on-hadoop-cluster-tp32693398p32693398.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Fixing Mis-replicated blocks
Do setrep -w on the increase to force the new replica before decreasing again. Of course, the little script only works if the replication factor is 3 on all the files. If it's a variable amount you should use the java API to get the existing factor and then increase by one and then decrease. Jeff On Thu, Oct 20, 2011 at 8:44 AM, John Meagher john.meag...@gmail.comwrote: After a hardware move with an unfortunate mis-setup rack awareness script our hadoop cluster has a large number of mis-replicated blocks. After about a week things haven't gotten better on their own. Is there a good way to trigger the name node to fix the mis-replicated blocks? Here's what I'm using for now, but it is very slow: for f in `hadoop fsck / | grep Replica placement policy is violated | head -n3000 | awk -F: '{print $1}'`; do hadoop fs -setrep 4 $f hadoop fs -setrep 3 $f done John