Re: Fixing Mis-replicated blocks

2011-10-20 Thread Jeff Bean
Do setrep -w on the increase to force the new replica before decreasing
again.

Of course, the little script only works if the replication factor is 3 on
all the files. If it's a variable amount you should use the java API to get
the existing factor and then increase by one and then decrease.

Jeff

On Thu, Oct 20, 2011 at 8:44 AM, John Meagher wrote:

> After a hardware move with an unfortunate mis-setup rack awareness
> script our hadoop cluster has a large number of mis-replicated blocks.
>  After about a week things haven't gotten better on their own.
>
> Is there a good way to trigger the name node to fix the mis-replicated
> blocks?
>
> Here's what I'm using for now, but it is very slow:
> for f in `hadoop fsck / | grep "Replica placement policy is violated"
> | head -n3000 | awk -F: '{print $1}'`; do
>hadoop fs -setrep 4 $f
>hadoop fs -setrep 3 $f
> done
>
> John
>


running sqoop on hadoop cluster

2011-10-20 Thread firantika

Hi All,
i'm newbie on hadoop,

if i installed hadoop on 2 node, where is hdfs running ? on master or slave
node ?

and then if i running sqoop for export dbms to hive, is it give effect on
speed up system between hadoop which running on single node and hadoop multi
node ?

please give me explaining ? 


Tks


-- 
View this message in context: 
http://old.nabble.com/running-sqoop-on-hadoop-cluster-tp32693398p32693398.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Connecting to vm through java

2011-10-20 Thread JAX
 Hi guys : im getting the dreaded 

org.apache.hadoop.ipc.Client$Connection handleConnectionFailure 

When connecting to clouderas hadoop (running in a vm) to request running a 
simple m/r job (from a machine outside the hadoop vm)..

I've seen a lot of posts about this online, and it's also on stack overflow 
here : 
http://stackoverflow.com/questions/6997327/connecting-to-cloudera-vm-from-my-desktop

Any tips on debugging Javas connection to hdfs over the network?

It's not entirely clear to me  how the connection is made/authenticated between 
the  client and hadoop, for example, is a passwordless ssh file required..? I 
believe this error is related to authentication but am not sure the best way to 
test it... I have confirmed that the ip is valid And it appears that hdfs is 
being run and served over the right default port in the vm.




Sent from my iPad

Re: Hadoop archive

2011-10-20 Thread John George
Could you try 0.20.205.0? The HAR issue in branch-20-security was updated
by JIRA HADOOP-7539.


-Original Message-
From: Jonas Hartwig 
Reply-To: "common-user@hadoop.apache.org" 
Date: Mon, 17 Oct 2011 02:11:24 -0700
To: "common-user@hadoop.apache.org" 
Subject: Hadoop archive

>Hi, im new to the community.
>
>Id like to create an archive but I get the error: "Exception in archives
>null".
>
>Im using hadoop 0.204.0. the issue was tracked under MAPREDUCE-1399
>  and solved. How
>do I combine my hadoop version with a new map/reduce release? And how do
>I get the release using firefox? I saw something like JIRA but the
>firefox plugin is not working with 7.x.
>
> 
>
>regards
>



Re: Capacity Scheduler : how to use more than the queue capacity ?

2011-10-20 Thread Sami Dalouche
Hi,

I ended up finding another post about the exact same issue on this exact
same mailing list, that was just a few days old...

It looks like the setting to play with
is mapred.capacity-scheduler.default-user-limit-factor

Sami

On Thu, Oct 20, 2011 at 1:25 PM, Sami Dalouche  wrote:

> Hi,
>
> By choosing the capacity scheduler, I was under the impression that each
> queue could borrow other queues' resources if they are available.
>
>
> Let's say we have the configuration below, and a total capacity of 180
> slots.
> What I expect is that whenever default and cpu-bound queues have no job,
> then jobs submitted to io-bound should be able to borrow up to 90 slots (50%
> total capacity).
> However, it looks like it never gets above 59 slots (33% of 180 slots).
>
> Is there something I missed ?
> Thanks,
> Sami Dalouche
>
> ---
> 
> mapred.capacity-scheduler.queue.default.capacity
> 33
>   
>   
>   mapred.capacity-scheduler.queue.default.maximum-capacity
>   50
> 
>   
> mapred.capacity-scheduler.queue.default.supports-priority
> true
>   
>
>   
>   
> mapred.capacity-scheduler.queue.io-bound.capacity
> 33
>   
>   
>
> mapred.capacity-scheduler.queue.io-bound.maximum-capacity
>   50
> 
>   
>
> mapred.capacity-scheduler.queue.io-bound.supports-priority
> true
>   
>
>   
>   
> mapred.capacity-scheduler.queue.cpu-bound.capacity
> 34
>   
>   
>
> mapred.capacity-scheduler.queue.cpu-bound.maximum-capacity
>   100
> 
>   
>
> mapred.capacity-scheduler.queue.cpu-bound.supports-priority
> true
>   
>
>


Capacity Scheduler : how to use more than the queue capacity ?

2011-10-20 Thread Sami Dalouche
Hi,

By choosing the capacity scheduler, I was under the impression that each
queue could borrow other queues' resources if they are available.


Let's say we have the configuration below, and a total capacity of 180
slots.
What I expect is that whenever default and cpu-bound queues have no job,
then jobs submitted to io-bound should be able to borrow up to 90 slots (50%
total capacity).
However, it looks like it never gets above 59 slots (33% of 180 slots).

Is there something I missed ?
Thanks,
Sami Dalouche

---

mapred.capacity-scheduler.queue.default.capacity
33
  
  
  mapred.capacity-scheduler.queue.default.maximum-capacity
  50

  
mapred.capacity-scheduler.queue.default.supports-priority
true
  

  
  
mapred.capacity-scheduler.queue.io-bound.capacity
33
  
  
  mapred.capacity-scheduler.queue.io-bound.maximum-capacity
  50

  
mapred.capacity-scheduler.queue.io-bound.supports-priority
true
  

  
  
mapred.capacity-scheduler.queue.cpu-bound.capacity
34
  
  

mapred.capacity-scheduler.queue.cpu-bound.maximum-capacity
  100

  

mapred.capacity-scheduler.queue.cpu-bound.supports-priority
true
  


Fixing Mis-replicated blocks

2011-10-20 Thread John Meagher
After a hardware move with an unfortunate mis-setup rack awareness
script our hadoop cluster has a large number of mis-replicated blocks.
 After about a week things haven't gotten better on their own.

Is there a good way to trigger the name node to fix the mis-replicated blocks?

Here's what I'm using for now, but it is very slow:
for f in `hadoop fsck / | grep "Replica placement policy is violated"
| head -n3000 | awk -F: '{print $1}'`; do
hadoop fs -setrep 4 $f
hadoop fs -setrep 3 $f
done

John


Re: Is there a good way to see how full hdfs is

2011-10-20 Thread Mapred Learn
Hi,
I have same question regarding the documentation and :
Is there something like this for memory and CPU utilization also ?

Sent from my iPhone

Thanks,
JJ

On Oct 19, 2011, at 5:00 PM, Rajiv Chittajallu  wrote:

> ivan.nov...@emc.com wrote on 10/18/11 at 09:23:50 -0700:
>> Cool, is there any documentation on how to use the JMX stuff to get
>> monitoring data?
> 
> I don't know if there is any specific documentation. These are the
> mbeans you might be interested in
> 
> Namenode:
> 
> Hadoop:service=NameNode,name=FSNamesystemState
> Hadoop:service=NameNode,name=NameNodeInfo
> Hadoop:service=NameNode,name=jvm
> 
> JobTracker:
> 
> Hadoop:service=JobTracker,name=JobTrackerInfo
> Hadoop:service=JobTracker,name=QueueMetrics,q=
> Hadoop:service=JobTracker,name=jvm
> 
> DataNode:
> Hadoop:name=DataNodeInfo,service=DataNode
> 
> TaskTracker:
> Hadoop:service=TaskTracker,name=TaskTrackerInfo
> 
> You may also want to monitor shuffle_exceptions_caught in 
> Hadoop:service=TaskTracker,name=ShuffleServerMetrics 
> 
>> 
>> Cheers,
>> Ivan
>> 
>> On 10/17/11 6:04 PM, "Rajiv Chittajallu"  wrote:
>> 
>>> If you are running > 0.20.204
>>> http://phanpy-nn1.hadoop.apache.org:50070/jmx?qry=Hadoop:service=NameNode,
>>> name=NameNodeInfo
>>> 
>>> 
>>> ivan.nov...@emc.com wrote on 10/17/11 at 09:18:20 -0700:
 Hi Harsh,
 
 I need access to the data programatically for system automation, and
 hence
 I do not want a monitoring tool but access to the raw data.
 
 I am more than happy to use an exposed function or client program and not
 an internal API.
 
 So i am still a bit confused... What is the simplest way to get at this
 raw disk usage data programmatically?  Is there a HDFS equivalent of du
 and df, or are you suggesting to just run that on the linux OS (which is
 perfectly doable).
 
 Cheers,
 Ivan
 
 
 On 10/17/11 9:05 AM, "Harsh J"  wrote:
 
> Uma/Ivan,
> 
> The DistributedFileSystem class explicitly is _not_ meant for public
> consumption, it is an internal one. Additionally, that method has been
> deprecated.
> 
> What you need is FileSystem#getStatus() if you want the summarized
> report via code.
> 
> A job, that possibly runs "du" or "df", is a good idea if you
> guarantee perfect homogeneity of path names in your cluster.
> 
> But I wonder, why won't using a general monitoring tool (such as
> nagios) for this purpose cut it? What's the end goal here?
> 
> P.s. I'd moved this conversation to hdfs-user@ earlier on, but now I
> see it being cross posted into mr-user, common-user, and common-dev --
> Why?
> 
> On Mon, Oct 17, 2011 at 9:25 PM, Uma Maheswara Rao G 72686
>  wrote:
>> We can write the simple program and you can call this API.
>> 
>> Make sure Hadoop jars presents in your class path.
>> Just for more clarification, DN will send their stats as parts of
>> hertbeats, So, NN will maintain all the statistics about the diskspace
>> usage for the complete filesystem and etc... This api will give you
>> that
>> stats.
>> 
>> Regards,
>> Uma
>> 
>> - Original Message -
>> From: ivan.nov...@emc.com
>> Date: Monday, October 17, 2011 9:07 pm
>> Subject: Re: Is there a good way to see how full hdfs is
>> To: common-user@hadoop.apache.org, mapreduce-u...@hadoop.apache.org
>> Cc: common-...@hadoop.apache.org
>> 
>>> So is there a client program to call this?
>>> 
>>> Can one write their own simple client to call this method from all
>>> diskson the cluster?
>>> 
>>> How about a map reduce job to collect from all disks on the cluster?
>>> 
>>> On 10/15/11 4:51 AM, "Uma Maheswara Rao G 72686"
>>> wrote:
>>> 
 /** Return the disk usage of the filesystem, including total
>>> capacity,>   * used space, and remaining space */
 public DiskStatus getDiskStatus() throws IOException {
   return dfs.getDiskStatus();
 }
 
 DistributedFileSystem has the above API from java API side.
 
 Regards,
 Uma
 
 - Original Message -
 From: wd 
 Date: Saturday, October 15, 2011 4:16 pm
 Subject: Re: Is there a good way to see how full hdfs is
 To: mapreduce-u...@hadoop.apache.org
 
> hadoop dfsadmin -report
> 
> On Sat, Oct 15, 2011 at 8:16 AM, Steve Lewis
>  wrote:
>> We have a small cluster with HDFS running on only 8 nodes - I
> believe that
>> the partition assigned to hdfs might be getting full and
>> wonder if the web tools or java api havew a way to look at free
> space on
>> hdfs
>> 
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>

Re: execute hadoop job from remote web application

2011-10-20 Thread Steve Loughran

On 18/10/11 17:56, Harsh J wrote:

Oleg,

It will pack up the jar that contains the class specified by
"setJarByClass" into its submission jar and send it up. Thats the
function of that particular API method. So, your deduction is almost
right there :)

On Tue, Oct 18, 2011 at 10:20 PM, Oleg Ruchovets  wrote:

So you mean that in case I am going to submit job remotely and
my_hadoop_job.jar
will be in class path of my web application it will submit job with
my_hadoop_job.jar to
remote hadoop machine (cluster)?




There's also the problem of waiting for your work to finish. If you want 
to see something complicated that does everything but JAR upload, I have 
some code here that listens for events coming out of the job and so 
builds up a history of what is happening. It also does better preflight 
checking of source and dest data directories


http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/mapreduce/submitter/SubmitterImpl.java


Re: mapr "common" library?

2011-10-20 Thread Alex Gauthier
Thanks guys and sorry for not being more specific but yes cloud9 and mahout
are definitely what I'm look for; much appreciated.

On Wed, Oct 19, 2011 at 9:23 PM, Harsh J  wrote:

> Alex,
>
> I know of Cloud9 http://lintool.github.com/Cloud9/index.html as a
> library that caters to Hadoop MapReduce specifically, but am sure
> there are others. I'm just not sure of a very prominent ones.
>
> Apache Mahout carries MR code in it as well for, and you might be
> interested in checking it out at http://mahout.apache.org.
>
> On Thu, Oct 20, 2011 at 8:36 AM, Alex Gauthier 
> wrote:
> > Is there such a thing somewhere? I have the basic nPath, lucene-like
> search
> > processing but looking for ETL like transformations, typical weblog
> > processor or clickstream. Anything beyond "wordcount" would be
> appreciated
> > :)
> >
> > GodSpeed.
> >
> > Alex  http://twitter.com/#!/A23Corp
> >
>
>
>
> --
> Harsh J
>