Behaviour of FileSystem.rename() and hadoop fs -mv command

2010-02-19 Thread Paul Ingles
Hi,

I'm hoping someone can point out where I've been particularly silly.

The behaviour I've observed when using the fs shell -mv command is that it will 
create any parent directories whilst doing the move? This appears to call 
rename on the FileSystem, which, when I've tried invoking directly doesn't seem 
to create the parent directories- it just returns false?

If I want to move a file (creating any parent directories) on HDFS is there an 
existing class I can use?

Thanks,
Paul

why not zookeeper for the namenode

2010-02-19 Thread Thomas Koch
Hi,

yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. 
From what I read, I thought, that bookkeeper would be the ideal enhancement 
for the namenode, to make it distributed and therefor finaly highly available.
Now I searched, if work in that direction has already started and found out, 
that apparently a totaly different approach has been choosen:
http://issues.apache.org/jira/browse/HADOOP-4539

Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if 
somebody could satisfy my curiosity:

- Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it  
  seems to do a similiar job already in hbase.

- Isn't it, that with HADOOP-4539 client's can only connect to one namenode at 
  a time, leaving the burden of all reads and writes on the one's shoulder?

- Isn't it, that zookeeper would be more network efficient. It requires only a
  majority of nodes to receive a change, while HADOOP-4539 seems to require
  all backup nodes to receive a change before its persisted.

Thanks for any explanation,

Thomas Koch, http://www.koch.ro


Re: why not zookeeper for the namenode

2010-02-19 Thread Todd Lipcon
On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch tho...@koch.ro wrote:
 Hi,

 yesterday I read the documentation of zookeeper and the zk contrib bookkeeper.
 From what I read, I thought, that bookkeeper would be the ideal enhancement
 for the namenode, to make it distributed and therefor finaly highly available.
 Now I searched, if work in that direction has already started and found out,
 that apparently a totaly different approach has been choosen:
 http://issues.apache.org/jira/browse/HADOOP-4539

 Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if
 somebody could satisfy my curiosity:


I didn't work on that particular design, but I'll do my best to answer
your questions below:

 - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it
  seems to do a similiar job already in hbase.


HBase does not use Bookkeeper, currently. Rather, it just uses ZK for
election and some small amount of metadata tracking. It therefore is
only storing a small amount of data in ZK, whereas the Hadoop NN would
have to store many GB worth of namesystem data. I don't think anyone
has tried putting such a large amount of data in ZK yet, and being the
first to do something is never without problems :)

Additionally, when this design was made, Bookkeeper was very new. It's
still in development, as I understand it.

 - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at
  a time, leaving the burden of all reads and writes on the one's shoulder?


Yes.

 - Isn't it, that zookeeper would be more network efficient. It requires only a
  majority of nodes to receive a change, while HADOOP-4539 seems to require
  all backup nodes to receive a change before its persisted.


Potentially. However, all backup nodes is usually just 1. In our
experience, and the experience of most other Hadoop deployments I've
spoken with, the primary factors decreasing NN availability are *not*
system crashes, but rather lack of online upgrade capability, slow
restart time for planned restarts, etc. Adding a hot standby can help
with the planned upgrade situation, but two standbys doesn't give you
much reliability above one. In a datacenter, the failure correlations
are generally such that racks either fail independently, or the entire
DC has lost power. So, there aren't a lot of cases where 3 NN replicas
would buy you much over 2.

-Todd

 Thanks for any explanation,

 Thomas Koch, http://www.koch.ro



Data-Intensive Text Processing with MapReduce

2010-02-19 Thread Jimmy Lin

Hi everyone,

I'm pleased to present the first complete draft of a forthcoming book:

Data-Intensive Text Processing with MapReduce
by Jimmy Lin and Chris Dyer

The complete text is available at:
http://www.umiacs.umd.edu/~jimmylin/book.html

It's slated for publication by Morgan  Claypool in mid-2010.

This text is currently being used in the MapReduce course at the 
University of Maryland.  The focus of the book is on algorithm design 
and thinking at scale.  Quite explicitly, the book is *not* about 
Hadoop programming.  Tom White's book already does that quite well... :)


Table of Contents

   1. Introduction
   2. MapReduce Basics
   3. MapReduce algorithm design
   4. Inverted Indexing for Text Retrieval
   5. Graph Algorithms
   6. EM Algorithms for Text Processing
   7. Closing Remarks

We hope you find this resource helpful... Comments and feedback are welcome!

Best,
Jimmy



Re: Developing cross-component patches post-split

2010-02-19 Thread ANKITBHATNAGAR

Hi,

I am working on 0.18.3 branch and it works fine for me.
Could you tell me - how you build the code?


-Make sure to checkout the project and create a new javaproject
Also make sure - not to use the default bin directory as there will be an
issue becoz hadoop write some class file and eclipse overides them.



Ankit

-- 
View this message in context: 
http://old.nabble.com/Developing-cross-component-patches-post-split-tp27634796p27656854.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



JVM heap and sort buffer size guidelines

2010-02-19 Thread Vasilis Liaskovitis
Hi,

For a node with M gigabytes of memory and N total child tasks (both
map + reduce) running on the node, what do people typically use for
the following parameters:

- Xmx (heap size per child task JVM)?
I.e. my question here is what percentage of the total memory node do
you use for the heaps of the tasks' JVMs. I am trying to reuse JVMs,
and there are roughly N task-JVMs on one node at any time. I 've tried
using a very large chunk of my memory of my node for heaps (i.e. close
to M/N) and I have seen better execution times without experiencing
swapping; but I am wondering if this is a job-specific behaviour. When
I 've used both -Xmx and -Xms set to the same heap size (i.e. maximum
and minum heap size the same to avoid contraction and expansion
overheads) I have run into some swapping; I guess Xms=Xmx should be
avoided if we are close to the physical memory limit.

- io.sort.mb and io.sort.factor. I understand that to answer this we
'd have to take the disk configuration into consideration. Do you
consider this only a function of disk or also a function of the heap
size? Obviously io.sort.mb  heapsize, but how much space do you leave
for non-sort buffer usage?

I am interested in small cluster setups ( 8-16 nodes), and not large
clusters, if that makes any difference.

- Vasilis


Re: JNI in MAp REuce

2010-02-19 Thread Allen Wittenauer

See http://issues.apache.org/jira/browse/HADOOP-2867 (and
https://issues.apache.org/jira/browse/HADOOP-5980 if you are using 0.21 or
Y! Hadoop w/LinuxTaskController). What version of Hadoop are you using?

Also, if this is custom code, what does the runtime link path look like ( -R
during compile time)?  Using $ORIGIN might be useful here.


On 2/19/10 8:47 AM, Utkarsh Agarwal unrealutka...@gmail.com wrote:

 How to set the  LD_LIBRARY_PATH  for the child , configuring mapred-site.xml
 doesn't work.  Also  setting -Djava.library.path is not good enough since it
 only gets the reference to the lib I am a trying to load(let's say lib.so) ,
 but that lib has dependencies on other libs like lib1.so  resulting in
 UnsatisfiedLinkError . Thus, LD_LIBRARY_PATH has to be set.
 
 
 
 On Thu, Feb 18, 2010 at 10:03 PM, Jason Venner jason.had...@gmail.comwrote:
 
 We used do this all the time at attributor. Now if I can remember how we
 did
 it.
 
 If the libraries are constant you can just install them on your nodes to
 save pushing them through the distributed cache, and then setup the
 LD_LIBRARY_PATH correctly.
 
 The key issue if you push them through the distributed cache is ensuring
 that the directory that the library gets dropped in, is actually in the
 runtime java.library.path
 You can also give explicit paths to System.load
 
 The -Djava.library.path in the child.options mapred.child.java.opts (if I
 have the param correct) should work also.
 
 On Thu, Feb 18, 2010 at 6:49 PM, Utkarsh Agarwal unrealutka...@gmail.com
 wrote:
 
 My .so file has other .so dependencies , so would I have to add them all
 in
 the DistributedCache . Also I tried setting LD_LIBRARY_PATH in
 mapred-site.xml as
 
 property
  namemapred.child.env/name
  valueLD_LIBRARY_PATH=/opt/libs//value
  /property
 
 
 doesnt work. the java.library.path is not sufficient to set , have to get
 LD_LIB set.
 
 -Utkarsh
 
 On Thu, Feb 18, 2010 at 3:14 PM, Allen Wittenauer
 awittena...@linkedin.comwrote:
 
 
 
 Like this:
 
 
 
 
 http://hadoop.apache.org/common/docs/current/native_libraries.html#Loading+n
 ative+libraries+through+DistributedCache
 
 
 
 On 2/16/10 5:29 PM, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:
 
 How would this work?
 
 On Fri, Feb 12, 2010 at 10:45 AM, Allen Wittenauer
 awittena...@linkedin.com wrote:
 
 ... or just use distributed cache.
 
 
 On 2/12/10 10:02 AM, Alex Kozlov ale...@cloudera.com wrote:
 
 All native libraries should be on each of the cluster nodes.  You
 need
 to
 set java.library.path property to point to your libraries (or
 just
 put
 them in the default system dirs).
 
 On Fri, Feb 12, 2010 at 9:12 AM, Utkarsh Agarwal
 unrealutka...@gmail.comwrote:
 
 Can anybody point me how to use JNI calls in a map reduce program.
 My
 .so
 files have other dependencies also , is there a way to load the
 LD_LIBRARY_PATH for child processes . Should all the native stuff
 be
 in
 HDFS?
 
 Thanks,
 Utkarsh.
 
 
 
 
 
 
 
 
 
 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals
 



Re: JNI in MAp REuce

2010-02-19 Thread Utkarsh Agarwal
I am using hadoop 0.20.1 , I added the attached patch still child processes
don't get the path :(

On Fri, Feb 19, 2010 at 10:57 AM, Allen Wittenauer awittena...@linkedin.com
 wrote:


 See http://issues.apache.org/jira/browse/HADOOP-2867 (and
 https://issues.apache.org/jira/browse/HADOOP-5980 if you are using 0.21 or
 Y! Hadoop w/LinuxTaskController). What version of Hadoop are you using?

 Also, if this is custom code, what does the runtime link path look like (
 -R
 during compile time)?  Using $ORIGIN might be useful here.


 On 2/19/10 8:47 AM, Utkarsh Agarwal unrealutka...@gmail.com wrote:

  How to set the  LD_LIBRARY_PATH  for the child , configuring
 mapred-site.xml
  doesn't work.  Also  setting -Djava.library.path is not good enough since
 it
  only gets the reference to the lib I am a trying to load(let's say
 lib.so) ,
  but that lib has dependencies on other libs like lib1.so  resulting in
  UnsatisfiedLinkError . Thus, LD_LIBRARY_PATH has to be set.
 
 
 
  On Thu, Feb 18, 2010 at 10:03 PM, Jason Venner jason.had...@gmail.com
 wrote:
 
  We used do this all the time at attributor. Now if I can remember how we
  did
  it.
 
  If the libraries are constant you can just install them on your nodes to
  save pushing them through the distributed cache, and then setup the
  LD_LIBRARY_PATH correctly.
 
  The key issue if you push them through the distributed cache is ensuring
  that the directory that the library gets dropped in, is actually in the
  runtime java.library.path
  You can also give explicit paths to System.load
 
  The -Djava.library.path in the child.options mapred.child.java.opts (if
 I
  have the param correct) should work also.
 
  On Thu, Feb 18, 2010 at 6:49 PM, Utkarsh Agarwal 
 unrealutka...@gmail.com
  wrote:
 
  My .so file has other .so dependencies , so would I have to add them
 all
  in
  the DistributedCache . Also I tried setting LD_LIBRARY_PATH in
  mapred-site.xml as
 
  property
   namemapred.child.env/name
   valueLD_LIBRARY_PATH=/opt/libs//value
   /property
 
 
  doesnt work. the java.library.path is not sufficient to set , have to
 get
  LD_LIB set.
 
  -Utkarsh
 
  On Thu, Feb 18, 2010 at 3:14 PM, Allen Wittenauer
  awittena...@linkedin.comwrote:
 
 
 
  Like this:
 
 
 
 
 
 http://hadoop.apache.org/common/docs/current/native_libraries.html#Loading+n
  ative+libraries+through+DistributedCache
 
 
 
  On 2/16/10 5:29 PM, Jason Rutherglen jason.rutherg...@gmail.com
  wrote:
 
  How would this work?
 
  On Fri, Feb 12, 2010 at 10:45 AM, Allen Wittenauer
  awittena...@linkedin.com wrote:
 
  ... or just use distributed cache.
 
 
  On 2/12/10 10:02 AM, Alex Kozlov ale...@cloudera.com wrote:
 
  All native libraries should be on each of the cluster nodes.  You
  need
  to
  set java.library.path property to point to your libraries (or
  just
  put
  them in the default system dirs).
 
  On Fri, Feb 12, 2010 at 9:12 AM, Utkarsh Agarwal
  unrealutka...@gmail.comwrote:
 
  Can anybody point me how to use JNI calls in a map reduce program.
  My
  .so
  files have other dependencies also , is there a way to load the
  LD_LIBRARY_PATH for child processes . Should all the native stuff
  be
  in
  HDFS?
 
  Thanks,
  Utkarsh.
 
 
 
 
 
 
 
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.amazon.com/dp/1430219424?tag=jewlerymall
  www.prohadoopbook.com a community for Hadoop Professionals
 


Index: src/mapred/mapred-default.xml
===
--- src/mapred/mapred-default.xml	(revision 781689)
+++ src/mapred/mapred-default.xml	(working copy)
@@ -406,6 +406,16 @@
 /property
 
 property
+  namemapred.child.env/name
+  value/value
+  descriptionUser added environment variables for the task tracker child 
+  processes. Example :
+  1) A=foo  This will set the env variable A to foo
+  2) B=$B:c This is inherit tasktracker's B env variable.  
+  /description
+/property
+
+property
   namemapred.child.ulimit/name
   value/value
   descriptionThe maximum virtual memory, in KB, of a process launched by the 
Index: src/mapred/org/apache/hadoop/mapred/TaskRunner.java
===
--- src/mapred/org/apache/hadoop/mapred/TaskRunner.java	(revision 781689)
+++ src/mapred/org/apache/hadoop/mapred/TaskRunner.java	(working copy)
@@ -399,6 +399,25 @@
 ldLibraryPath.append(oldLdLibraryPath);
   }
   env.put(LD_LIBRARY_PATH, ldLibraryPath.toString());
+  
+  // add the env variables passed by the user
+  String mapredChildEnv = conf.get(mapred.child.env);
+  if (mapredChildEnv != null  mapredChildEnv.length()  0) {
+String childEnvs[] = mapredChildEnv.split(,);
+for (String cEnv : childEnvs) {
+  String[] parts = cEnv.split(=); // split on '='
+  String value = env.get(parts[0]);
+  if (value != null) {
+// replace $env with the tt's value of env
+value 

HDFS vs Giant Direct Attached Arrays

2010-02-19 Thread Edward Capriolo
Hadoop is great. Almost every day I live gives me more reasons to like it.

My story for today:
We have a system running a file system with a 48 TB Disk array on 4
shelves. Today I got this information about firmware updates. (don't
you love firmware updates?)

---
Any  Controllers configured with any of the  SATA hard drives
listed in the Scope section might exhibit slow virtual disk
initialization/expansion, rare drive stalls and timeouts, scrubbing
errors, and reduced performance.

Data might be at risk if multiple drive failures occur. Proactive hard
drive replacement is neither necessary, nor authorized.

Updating the firmware of disk drives in a virtual disk risks the loss
of data and causes the drives to be temporarily inaccessible.”
---
In a nutshell, the safest way is to offline the system and update
disks one at a time (we don't know how long updating one disk takes).
Or we have to smart fail disks and move them out of this array into
another array, (lucky we have another one) apply the firmware, put the
disk back in, wait for re-stripe! Repeat 47 times!

So the options are:
1) Risky -- do the update online hope we do not corrupt the thing
2) slow -- offline the system update 1 disk at a time as suggested

No option has 0 downtime. Also note that since this updates fixes 
reduced performance. Thus this chassis was never operating at max
efficiency, due to whatever reason, RAID card complexity, firmware,
back plane, whatever.

Now, imagine if this was a 6 node hadoop systems with 8 disks a node,
and we had to do a firmware updates. Wow! this would be easy. We could
accomplish this with no system-wide outage, at our leisure. With a
file replication factor of 3 we could hot swap disks, or even safely
fail an entire node with no outage.

We would not need spare hardware, need to inform people of an outage,
or disable alerts. Hadoop would not care if the firmware on all the
disks did not match. Hadoop did not have some complicated RAID that
was running at  reduced performance. all this time. Hadoop just uses
independent disks, much less complexity.

HDFS ForTheWin!


Some information on Hadoop Sort

2010-02-19 Thread aa225
Hello,
  I was wondering if some one could me some information on hadoop does the
sorting. From what I have read there does not seem to be a map class and reduce
class ? Where and how is the sorting parallelized ?


Best Regards from Buffalo

Abhishek Agrawal

SUNY- Buffalo
(716-435-7122)





Re: Some information on Hadoop Sort

2010-02-19 Thread Gang Luo
Hi,
the sorting is done by the MapReduce framework. At map side, the output record 
will first go to a sorting buffer where the sorting, partitioning and combining 
(if there is combiner) happen. If necessary, multi-phase sorting is done to 
make a single sorted result for each map task. At reduce side, all the data 
from multiple map tasks will be merged (each of them is sorted at the map side, 
you only need merge sort here). It goes multiple rounds if necessary.

-Gang



- 原始邮件 
发件人: aa...@buffalo.edu aa...@buffalo.edu
收件人: common-user@hadoop.apache.org
发送日期: 2010/2/19 (周五) 2:25:50 下午
主   题: Some information on Hadoop Sort

Hello,
  I was wondering if some one could me some information on hadoop does the
sorting. From what I have read there does not seem to be a map class and reduce
class ? Where and how is the sorting parallelized ?


Best Regards from Buffalo

Abhishek Agrawal

SUNY- Buffalo
(716-435-7122)


  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/


On CDH2, (Cloudera EC2) No valid local directories in property: mapred.local.dir

2010-02-19 Thread Saptarshi Guha
Hello,
Not sure if i should post this here or on Cloudera's message board,
but here goes.
When I run EC2 using the latest CDH2 and Hadoop 0.20 (by settiing the
env variables are hadoop-ec2),
and launch a job
hadoop jar ...

I get the following error


10/02/19 17:04:55 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid
local directories in property: mapred.local.dir
at 
org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:975)
at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:279)
at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:256)
at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:240)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at org.apache.hadoop.mapred.$Proxy0.submitJob(Unknown Source)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:841)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)

at org.godhuli.f.RHMR.submitAndMonitorJob(RHMR.java:195)

but the value  of mapred.local.dir is /mnt/hadoop/mapred/local

Any ideas?


hadoop-streaming tutorial with -archives option

2010-02-19 Thread Michael Kintzer
Hi,

Hadoop/HDFS newbie.  Been struggling with getting the streaming example working 
with -archives.   c.f.  
http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming

My environment is the Pseudo-distributed environment setup per: 
http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed

I've run into a couple issues.   First issue is FileNotFoundException when 
the #symlink suffix is specified with the -archives or -files options as per 
the tutorial.

hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives 
hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink -input 
samples/cachefile/input.txt -mapper xargs cat -reducer cat -output 
samples/cachefile/out
java.io.FileNotFoundException: File 
hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does not 
exist.
at 
org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349)
at 
org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275)
at 
org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
at 
org.apache.hadoop.util.GenericOptionsParser.init(GenericOptionsParser.java:153)
at 
org.apache.hadoop.util.GenericOptionsParser.init(GenericOptionsParser.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at 
org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

If I remove the #testlink from the archives definition, the error goes away 
but the symlink is not created, as per the tutorial documentation.

I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, 
shows no FIX version, but the Issue Links to others which are supposedly fixed 
in 0.20.1 which I have.

2nd issue is Unrecognized option -archives when -archives is specified at the 
end of the arg list.  

hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input 
samples/cachefile/input.txt -mapper xargs cat -reducer cat -output 
samples/cachefile/out9 -archives 
hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink
10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives

Any help getting past this appreciated.Am I missing a configuration setting 
that allows symlinking?  Really hoping to use the archives feature.

-Michael
 

Fixed Re: On CDH2, (Cloudera EC2) No valid local directories in property: mapred.local.dir

2010-02-19 Thread Saptarshi Guha
As a follow up, this is not peculiar to my programs but hadoop ones too:
I started the cluster as (using Fedora AMI and c1.medium)

hadoop-ec2 launch-cluster --env REPO=testing --env HADOOP_VERSION=0.20 test2 1

and error is [1]

But it works(and so do my programs) if I launch a cluster with e.g. 4
nodes as opposed to 1.

Regards
Saptarshi

[1]

[r...@ip-10-250-195-225 ~]# hadoop jar
/usr/lib/hadoop-0.20/hadoop-0.20.1+169.56-examples.jar  wordcount
/tmp/foo /tmp/f1


10/02/19 17:31:45 WARN conf.Configuration: DEPRECATED: hadoop-site.xml
found in the classpath. Usage of hadoop-site.xml is deprecated.
Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to
override properties of core-default.xml, mapred-default.xml and
hdfs-default.xml respectively
10/02/19 17:31:45 INFO input.FileInputFormat: Total input paths to process : 1
org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid
local directories in property: mapred.local.dir
at 
org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:975)
at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:279)
at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:256)
at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:240)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at org.apache.hadoop.mapred.$Proxy0.submitJob(Unknown Source)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:841)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



On Fri, Feb 19, 2010 at 5:13 PM, Saptarshi Guha
saptarshi.g...@gmail.com wrote:
 Hello,
 Not sure if i should post this here or on Cloudera's message board,
 but here goes.
 When I run EC2 using the latest CDH2 and Hadoop 0.20 (by settiing the
 env variables are hadoop-ec2),
 and launch a job
 hadoop jar ...

 I get the following error


 10/02/19 17:04:55 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the
 same.
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid
 local directories in property: mapred.local.dir
        at 
 org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:975)
        at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:279)
        at 
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:256)
        at 
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:240)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at 

reuse JVMs across multiple jobs

2010-02-19 Thread Vasilis Liaskovitis
Hi,

Is it possible (and does it make sense) to reuse JVMs across jobs?

The job.reuse.jvm.num.tasks config option is a job specific parameter,
as its name implies. When running multiple independent jobs
simultaneously with job.reuse.jvm=-1 (this means always reuse), I see
a lot of different Java PIDs (far more than map.tasks.maximum +
reduce.tasks.maximum) across the duration of the job runs, instead of
the same Java processes persisting. The number of live JVMs on a given
node/tasktracker at any time never exceeds map.tasks.maximum +
reduce.tasks.maximum, as expected, but we do tear down idle JVMs and
spawn new ones quite often.

for example, here are the number of distinct Java PIDs when submitting
1, 4, 32 copies of the same job in parallel:
1   28
2   39
4   106
32  740

The relevant killing and spawing logic should be in
src/mapred/org/mapred/org/apache/hadoop/mapred/JvmManager.java,
particularly the reapJvm() method, but I haven't dug deeper. I am
wondering if it would be possible and worthwhile from a performance
standpoint to be able to reuse JVMs across jobs i.e. have a common JVM
pool for all hadoop jobs?
thanks,

- Vasilis


Re: Many child processes dont exit

2010-02-19 Thread Zheng Lv
Hello Edson,
  Thank you for your reply. I don't want to kill them, I want to know why
these child processes don't exit, and to know how to make them exit
successfully when they finish. Any ideas? Thank you.
LvZheng

2010/2/18 Edson Ramiro erlfi...@gmail.com

 Do you want to kill them ?

 if yes, you can use

  ./bin/slaves.sh pkill java

 but it will kill the datanode and tasktracker processes
 in all slaves and you'll need to start these processes again.

 Edson Ramiro


 On 14 February 2010 22:09, Zheng Lv lvzheng19800...@gmail.com wrote:

  any idea?
 
  2010/2/11 Zheng Lv lvzheng19800...@gmail.com
 
   Hello Everyone,
 We often find many child processes in datanodes, which have already
   finished for long time. And following are the jstack log:
 Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed
  mode):
   DestroyJavaVM prio=10 tid=0x2aaac8019800 nid=0x2422 waiting on
   condition [0x]
  java.lang.Thread.State: RUNNABLE
   NioProcessor-31 prio=10 tid=0x439fa000 nid=0x2826 runnable
   [0x4100a000]
  java.lang.Thread.State: RUNNABLE
   at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
   at
 sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
   at
  sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
   at
 sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
   - locked 0x2aaab9b5f6f8 (a sun.nio.ch.Util$1)
   - locked 0x2aaab9b5f710 (a
   java.util.Collections$UnmodifiableSet)
   - locked 0x2aaab9b5f680 (a sun.nio.ch.EPollSelectorImpl)
   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
   at
  
 
 org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:65)
   at
  
 
 org.apache.mina.common.AbstractPollingIoProcessor$Worker.run(AbstractPollingIoProcessor.java:672)
   at
  
 
 org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)
   at
  
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
  
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:619)
   pool-15-thread-1 prio=10 tid=0x2aaac802d000 nid=0x2825 waiting on
   condition [0x41604000]
  java.lang.Thread.State: WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for  0x2aaab9b61620 (a
   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   at
   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
   at
  
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
   at
  
 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
   at
  
 
 java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
   at
  
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
   at java.lang.Thread.run(Thread.java:619)
   Attach Listener daemon prio=10 tid=0x43a22800 nid=0x2608
  waiting
   on condition [0x]
  java.lang.Thread.State: RUNNABLE
   pool-3-thread-1 prio=10 tid=0x438ad000 nid=0x243f waiting on
   condition [0x42575000]
  java.lang.Thread.State: TIMED_WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for  0x2aaab9ae8770 (a
   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   at
   java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
   at
  
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
   at java.util.concurrent.DelayQueue.take(DelayQueue.java:164)
   at
  
 
 java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:583)
   at
  
 
 java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:576)
   at
  
 
 java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
   at
  
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
   at java.lang.Thread.run(Thread.java:619)
   Thread for syncLogs daemon prio=10 tid=0x2aaac4446800 nid=0x2437
   waiting on condition [0x40f09000]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.mapred.Child$2.run(Child.java:87)
   Low Memory Detector daemon prio=10 tid=0x43771800 nid=0x242d
   runnable [0x]
  java.lang.Thread.State: RUNNABLE
   CompilerThread1 daemon prio=10 tid=0x4376e800 nid=0x242c
  waiting
   

hadoop-streaming tutorial with -archives option

2010-02-19 Thread Michael Kintzer
 
 Hi,
 
 Hadoop/HDFS newbie.  Been struggling with getting the streaming example 
 working with -archives.   c.f.  
 http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming
 
 My environment is the Pseudo-distributed environment setup per: 
 http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed
 
 I've run into a couple issues.   First issue is FileNotFoundException when 
 the #symlink suffix is specified with the -archives or -files options as per 
 the tutorial.
   
 hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives 
 hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink 
 -input samples/cachefile/input.txt -mapper xargs cat -reducer cat 
 -output samples/cachefile/out
 java.io.FileNotFoundException: File 
 hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does 
 not exist.
   at 
 org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349)
   at 
 org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275)
   at 
 org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
   at 
 org.apache.hadoop.util.GenericOptionsParser.init(GenericOptionsParser.java:153)
   at 
 org.apache.hadoop.util.GenericOptionsParser.init(GenericOptionsParser.java:138)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at 
 org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 
 If I remove the #testlink from the archives definition, the error goes away 
 but the symlink is not created, as per the tutorial documentation.
 
 I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, 
 shows no FIX version, but the Issue Links to others which are supposedly 
 fixed in 0.20.1 which I have.
 
 2nd issue is Unrecognized option -archives when -archives is specified at 
 the end of the arg list.  
 
 hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input 
 samples/cachefile/input.txt -mapper xargs cat -reducer cat -output 
 samples/cachefile/out9 -archives 
 hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink
 10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives
 
 Any help getting past this appreciated.Am I missing a configuration 
 setting that allows symlinking?  Really hoping to use the archives feature.
 
 -Michael