Re: mapr common library?

2011-10-20 Thread Alex Gauthier
Thanks guys and sorry for not being more specific but yes cloud9 and mahout
are definitely what I'm look for; much appreciated.

On Wed, Oct 19, 2011 at 9:23 PM, Harsh J ha...@cloudera.com wrote:

 Alex,

 I know of Cloud9 http://lintool.github.com/Cloud9/index.html as a
 library that caters to Hadoop MapReduce specifically, but am sure
 there are others. I'm just not sure of a very prominent ones.

 Apache Mahout carries MR code in it as well for, and you might be
 interested in checking it out at http://mahout.apache.org.

 On Thu, Oct 20, 2011 at 8:36 AM, Alex Gauthier alexgauthie...@gmail.com
 wrote:
  Is there such a thing somewhere? I have the basic nPath, lucene-like
 search
  processing but looking for ETL like transformations, typical weblog
  processor or clickstream. Anything beyond wordcount would be
 appreciated
  :)
 
  GodSpeed.
 
  Alex  http://twitter.com/#!/A23Corp
 



 --
 Harsh J



Re: execute hadoop job from remote web application

2011-10-20 Thread Steve Loughran

On 18/10/11 17:56, Harsh J wrote:

Oleg,

It will pack up the jar that contains the class specified by
setJarByClass into its submission jar and send it up. Thats the
function of that particular API method. So, your deduction is almost
right there :)

On Tue, Oct 18, 2011 at 10:20 PM, Oleg Ruchovetsoruchov...@gmail.com  wrote:

So you mean that in case I am going to submit job remotely and
my_hadoop_job.jar
will be in class path of my web application it will submit job with
my_hadoop_job.jar to
remote hadoop machine (cluster)?




There's also the problem of waiting for your work to finish. If you want 
to see something complicated that does everything but JAR upload, I have 
some code here that listens for events coming out of the job and so 
builds up a history of what is happening. It also does better preflight 
checking of source and dest data directories


http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/mapreduce/submitter/SubmitterImpl.java


Re: Is there a good way to see how full hdfs is

2011-10-20 Thread Mapred Learn
Hi,
I have same question regarding the documentation and :
Is there something like this for memory and CPU utilization also ?

Sent from my iPhone

Thanks,
JJ

On Oct 19, 2011, at 5:00 PM, Rajiv Chittajallu raj...@yahoo-inc.com wrote:

 ivan.nov...@emc.com wrote on 10/18/11 at 09:23:50 -0700:
 Cool, is there any documentation on how to use the JMX stuff to get
 monitoring data?
 
 I don't know if there is any specific documentation. These are the
 mbeans you might be interested in
 
 Namenode:
 
 Hadoop:service=NameNode,name=FSNamesystemState
 Hadoop:service=NameNode,name=NameNodeInfo
 Hadoop:service=NameNode,name=jvm
 
 JobTracker:
 
 Hadoop:service=JobTracker,name=JobTrackerInfo
 Hadoop:service=JobTracker,name=QueueMetrics,q=queuename
 Hadoop:service=JobTracker,name=jvm
 
 DataNode:
 Hadoop:name=DataNodeInfo,service=DataNode
 
 TaskTracker:
 Hadoop:service=TaskTracker,name=TaskTrackerInfo
 
 You may also want to monitor shuffle_exceptions_caught in 
 Hadoop:service=TaskTracker,name=ShuffleServerMetrics 
 
 
 Cheers,
 Ivan
 
 On 10/17/11 6:04 PM, Rajiv Chittajallu raj...@yahoo-inc.com wrote:
 
 If you are running  0.20.204
 http://phanpy-nn1.hadoop.apache.org:50070/jmx?qry=Hadoop:service=NameNode,
 name=NameNodeInfo
 
 
 ivan.nov...@emc.com wrote on 10/17/11 at 09:18:20 -0700:
 Hi Harsh,
 
 I need access to the data programatically for system automation, and
 hence
 I do not want a monitoring tool but access to the raw data.
 
 I am more than happy to use an exposed function or client program and not
 an internal API.
 
 So i am still a bit confused... What is the simplest way to get at this
 raw disk usage data programmatically?  Is there a HDFS equivalent of du
 and df, or are you suggesting to just run that on the linux OS (which is
 perfectly doable).
 
 Cheers,
 Ivan
 
 
 On 10/17/11 9:05 AM, Harsh J ha...@cloudera.com wrote:
 
 Uma/Ivan,
 
 The DistributedFileSystem class explicitly is _not_ meant for public
 consumption, it is an internal one. Additionally, that method has been
 deprecated.
 
 What you need is FileSystem#getStatus() if you want the summarized
 report via code.
 
 A job, that possibly runs du or df, is a good idea if you
 guarantee perfect homogeneity of path names in your cluster.
 
 But I wonder, why won't using a general monitoring tool (such as
 nagios) for this purpose cut it? What's the end goal here?
 
 P.s. I'd moved this conversation to hdfs-user@ earlier on, but now I
 see it being cross posted into mr-user, common-user, and common-dev --
 Why?
 
 On Mon, Oct 17, 2011 at 9:25 PM, Uma Maheswara Rao G 72686
 mahesw...@huawei.com wrote:
 We can write the simple program and you can call this API.
 
 Make sure Hadoop jars presents in your class path.
 Just for more clarification, DN will send their stats as parts of
 hertbeats, So, NN will maintain all the statistics about the diskspace
 usage for the complete filesystem and etc... This api will give you
 that
 stats.
 
 Regards,
 Uma
 
 - Original Message -
 From: ivan.nov...@emc.com
 Date: Monday, October 17, 2011 9:07 pm
 Subject: Re: Is there a good way to see how full hdfs is
 To: common-user@hadoop.apache.org, mapreduce-u...@hadoop.apache.org
 Cc: common-...@hadoop.apache.org
 
 So is there a client program to call this?
 
 Can one write their own simple client to call this method from all
 diskson the cluster?
 
 How about a map reduce job to collect from all disks on the cluster?
 
 On 10/15/11 4:51 AM, Uma Maheswara Rao G 72686
 mahesw...@huawei.comwrote:
 
 /** Return the disk usage of the filesystem, including total
 capacity,   * used space, and remaining space */
 public DiskStatus getDiskStatus() throws IOException {
   return dfs.getDiskStatus();
 }
 
 DistributedFileSystem has the above API from java API side.
 
 Regards,
 Uma
 
 - Original Message -
 From: wd w...@wdicc.com
 Date: Saturday, October 15, 2011 4:16 pm
 Subject: Re: Is there a good way to see how full hdfs is
 To: mapreduce-u...@hadoop.apache.org
 
 hadoop dfsadmin -report
 
 On Sat, Oct 15, 2011 at 8:16 AM, Steve Lewis
 lordjoe2...@gmail.com wrote:
 We have a small cluster with HDFS running on only 8 nodes - I
 believe that
 the partition assigned to hdfs might be getting full and
 wonder if the web tools or java api havew a way to look at free
 space on
 hdfs
 
 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com
 
 
 
 
 
 
 
 
 
 
 
 -- 
 Harsh J
 
 
 


Fixing Mis-replicated blocks

2011-10-20 Thread John Meagher
After a hardware move with an unfortunate mis-setup rack awareness
script our hadoop cluster has a large number of mis-replicated blocks.
 After about a week things haven't gotten better on their own.

Is there a good way to trigger the name node to fix the mis-replicated blocks?

Here's what I'm using for now, but it is very slow:
for f in `hadoop fsck / | grep Replica placement policy is violated
| head -n3000 | awk -F: '{print $1}'`; do
hadoop fs -setrep 4 $f
hadoop fs -setrep 3 $f
done

John


Capacity Scheduler : how to use more than the queue capacity ?

2011-10-20 Thread Sami Dalouche
Hi,

By choosing the capacity scheduler, I was under the impression that each
queue could borrow other queues' resources if they are available.


Let's say we have the configuration below, and a total capacity of 180
slots.
What I expect is that whenever default and cpu-bound queues have no job,
then jobs submitted to io-bound should be able to borrow up to 90 slots (50%
total capacity).
However, it looks like it never gets above 59 slots (33% of 180 slots).

Is there something I missed ?
Thanks,
Sami Dalouche

---
property
namemapred.capacity-scheduler.queue.default.capacity/name
value33/value
  /property
  property
  namemapred.capacity-scheduler.queue.default.maximum-capacity/name
  value50/value
/property
  property
namemapred.capacity-scheduler.queue.default.supports-priority/name
valuetrue/value
  /property

  !-- queue: io-bound --
  property
namemapred.capacity-scheduler.queue.io-bound.capacity/name
value33/value
  /property
  property
  namemapred.capacity-scheduler.queue.io-bound.maximum-capacity/name
  value50/value
/property
  property
namemapred.capacity-scheduler.queue.io-bound.supports-priority/name
valuetrue/value
  /property

  !-- queue: cpu-bound --
  property
namemapred.capacity-scheduler.queue.cpu-bound.capacity/name
value34/value
  /property
  property

namemapred.capacity-scheduler.queue.cpu-bound.maximum-capacity/name
  value100/value
/property
  property

namemapred.capacity-scheduler.queue.cpu-bound.supports-priority/name
valuetrue/value
  /property


Re: Capacity Scheduler : how to use more than the queue capacity ?

2011-10-20 Thread Sami Dalouche
Hi,

I ended up finding another post about the exact same issue on this exact
same mailing list, that was just a few days old...

It looks like the setting to play with
is mapred.capacity-scheduler.default-user-limit-factor

Sami

On Thu, Oct 20, 2011 at 1:25 PM, Sami Dalouche sa...@hopper.com wrote:

 Hi,

 By choosing the capacity scheduler, I was under the impression that each
 queue could borrow other queues' resources if they are available.


 Let's say we have the configuration below, and a total capacity of 180
 slots.
 What I expect is that whenever default and cpu-bound queues have no job,
 then jobs submitted to io-bound should be able to borrow up to 90 slots (50%
 total capacity).
 However, it looks like it never gets above 59 slots (33% of 180 slots).

 Is there something I missed ?
 Thanks,
 Sami Dalouche

 ---
 property
 namemapred.capacity-scheduler.queue.default.capacity/name
 value33/value
   /property
   property
   namemapred.capacity-scheduler.queue.default.maximum-capacity/name
   value50/value
 /property
   property
 namemapred.capacity-scheduler.queue.default.supports-priority/name
 valuetrue/value
   /property

   !-- queue: io-bound --
   property
 namemapred.capacity-scheduler.queue.io-bound.capacity/name
 value33/value
   /property
   property

 namemapred.capacity-scheduler.queue.io-bound.maximum-capacity/name
   value50/value
 /property
   property

 namemapred.capacity-scheduler.queue.io-bound.supports-priority/name
 valuetrue/value
   /property

   !-- queue: cpu-bound --
   property
 namemapred.capacity-scheduler.queue.cpu-bound.capacity/name
 value34/value
   /property
   property

 namemapred.capacity-scheduler.queue.cpu-bound.maximum-capacity/name
   value100/value
 /property
   property

 namemapred.capacity-scheduler.queue.cpu-bound.supports-priority/name
 valuetrue/value
   /property




Re: Hadoop archive

2011-10-20 Thread John George
Could you try 0.20.205.0? The HAR issue in branch-20-security was updated
by JIRA HADOOP-7539.


-Original Message-
From: Jonas Hartwig jonas.hart...@cision.com
Reply-To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Date: Mon, 17 Oct 2011 02:11:24 -0700
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Subject: Hadoop archive

Hi, im new to the community.

Id like to create an archive but I get the error: Exception in archives
null.

Im using hadoop 0.204.0. the issue was tracked under MAPREDUCE-1399
https://issues.apache.org/jira/browse/MAPREDUCE-1399  and solved. How
do I combine my hadoop version with a new map/reduce release? And how do
I get the release using firefox? I saw something like JIRA but the
firefox plugin is not working with 7.x.

 

regards




Connecting to vm through java

2011-10-20 Thread JAX
 Hi guys : im getting the dreaded 

org.apache.hadoop.ipc.Client$Connection handleConnectionFailure 

When connecting to clouderas hadoop (running in a vm) to request running a 
simple m/r job (from a machine outside the hadoop vm)..

I've seen a lot of posts about this online, and it's also on stack overflow 
here : 
http://stackoverflow.com/questions/6997327/connecting-to-cloudera-vm-from-my-desktop

Any tips on debugging Javas connection to hdfs over the network?

It's not entirely clear to me  how the connection is made/authenticated between 
the  client and hadoop, for example, is a passwordless ssh file required..? I 
believe this error is related to authentication but am not sure the best way to 
test it... I have confirmed that the ip is valid And it appears that hdfs is 
being run and served over the right default port in the vm.




Sent from my iPad

running sqoop on hadoop cluster

2011-10-20 Thread firantika

Hi All,
i'm newbie on hadoop,

if i installed hadoop on 2 node, where is hdfs running ? on master or slave
node ?

and then if i running sqoop for export dbms to hive, is it give effect on
speed up system between hadoop which running on single node and hadoop multi
node ?

please give me explaining ? 


Tks


-- 
View this message in context: 
http://old.nabble.com/running-sqoop-on-hadoop-cluster-tp32693398p32693398.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Fixing Mis-replicated blocks

2011-10-20 Thread Jeff Bean
Do setrep -w on the increase to force the new replica before decreasing
again.

Of course, the little script only works if the replication factor is 3 on
all the files. If it's a variable amount you should use the java API to get
the existing factor and then increase by one and then decrease.

Jeff

On Thu, Oct 20, 2011 at 8:44 AM, John Meagher john.meag...@gmail.comwrote:

 After a hardware move with an unfortunate mis-setup rack awareness
 script our hadoop cluster has a large number of mis-replicated blocks.
  After about a week things haven't gotten better on their own.

 Is there a good way to trigger the name node to fix the mis-replicated
 blocks?

 Here's what I'm using for now, but it is very slow:
 for f in `hadoop fsck / | grep Replica placement policy is violated
 | head -n3000 | awk -F: '{print $1}'`; do
hadoop fs -setrep 4 $f
hadoop fs -setrep 3 $f
 done

 John