Restricting quota for users in HDFS

2009-06-16 Thread Palleti, Pallavi
Hi all,

 

We have chown command in hadoop dfs to make a particular directory own
by a person. Do we have something similar to create user with some space
limit/restrict the disk usage by a particular user?

 

Thanks

Pallavi 



Re: Datanodes fail to start

2009-06-16 Thread Ian jonhson
If you rebuilt the hadoop, following the wikipage of HowToRelease may
reduce the trouble occurred.


On Sat, May 16, 2009 at 7:20 AM, Pankil Doshiforpan...@gmail.com wrote:
 I got the solution..

 Namespace IDs where some how incompatible.So I had to clean data dir and
 temp dir ,format the cluster and make a fresh start

 Pankil

 On Fri, May 15, 2009 at 2:25 AM, jason hadoop jason.had...@gmail.comwrote:

 There should be a few more lines at the end.
 We only want the part from last the STARTUP_MSG to the end

 On one of mine a successfull start looks like this:
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = at/192.168.1.119
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.19.1-dev
 STARTUP_MSG:   build =  -r ; compiled by 'jason' on Tue Mar 17 04:03:57 PDT
 2009
 /
 2009-03-17 03:08:11,884 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Registered
 FSDatasetStatusMBean
 2009-03-17 03:08:11,886 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at
 50010
 2009-03-17 03:08:11,889 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is
 1048576 bytes/s
 2009-03-17 03:08:12,142 INFO org.mortbay.http.HttpServer: Version
 Jetty/5.1.4
 2009-03-17 03:08:12,155 INFO org.mortbay.util.Credential: Checking Resource
 aliases
 2009-03-17 03:08:12,518 INFO org.mortbay.util.Container: Started
 org.mortbay.jetty.servlet.webapplicationhand...@1e184cb
 2009-03-17 03:08:12,578 INFO org.mortbay.util.Container: Started
 WebApplicationContext[/static,/static]
 2009-03-17 03:08:12,721 INFO org.mortbay.util.Container: Started
 org.mortbay.jetty.servlet.webapplicationhand...@1d9e282
 2009-03-17 03:08:12,722 INFO org.mortbay.util.Container: Started
 WebApplicationContext[/logs,/logs]
 2009-03-17 03:08:12,878 INFO org.mortbay.util.Container: Started
 org.mortbay.jetty.servlet.webapplicationhand...@14a75bb
 2009-03-17 03:08:12,884 INFO org.mortbay.util.Container: Started
 WebApplicationContext[/,/]
 2009-03-17 03:08:12,951 INFO org.mortbay.http.SocketListener: Started
 SocketListener on 0.0.0.0:50075
 2009-03-17 03:08:12,951 INFO org.mortbay.util.Container: Started
 org.mortbay.jetty.ser...@1358f03
 2009-03-17 03:08:12,957 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=DataNode, sessionId=null
 2009-03-17 03:08:13,242 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=DataNode, port=50020
 2009-03-17 03:08:13,264 INFO org.apache.hadoop.ipc.Server: IPC Server
 Responder: starting
 2009-03-17 03:08:13,304 INFO org.apache.hadoop.ipc.Server: IPC Server
 listener on 50020: starting
 2009-03-17 03:08:13,343 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 0 on 50020: starting
 2009-03-17 03:08:13,343 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration =
 DatanodeRegistration(192.168.1.119:50010,
 storageID=DS-540597485-192.168.1.119-50010-1237022386925, infoPort=50075,
 ipcPort=50020)
 2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 1 on 50020: starting
 2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 2 on 50020: starting
 2009-03-17 03:08:13,351 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
 192.168.1.119:50010,
 storageID=DS-540597485-192.168.1.119-50010-1237022386925, infoPort=50075,
 ipcPort=50020)In DataNode.run, data =
 FSDataset{dirpath='/tmp/hadoop-0.19.0-jason/dfs/data/current'}
 2009-03-17 03:08:13,352 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL
 of 360msec Initial delay: 0msec
 2009-03-17 03:08:13,391 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 14 blocks
 got processed in 27 msecs
 2009-03-17 03:08:13,392 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block
 scanner.



 On Thu, May 14, 2009 at 9:51 PM, Pankil Doshi forpan...@gmail.com wrote:

  This is log from datanode.
 
 
  2009-05-14 00:36:14,559 INFO org.apache.hadoop.dfs.DataNode: BlockReport
 of
  82 blocks got processed in 12 msecs
  2009-05-14 01:36:15,768 INFO org.apache.hadoop.dfs.DataNode: BlockReport
 of
  82 blocks got processed in 8 msecs
  2009-05-14 02:36:13,975 INFO org.apache.hadoop.dfs.DataNode: BlockReport
 of
  82 blocks got processed in 9 msecs
  2009-05-14 03:36:15,189 INFO org.apache.hadoop.dfs.DataNode: BlockReport
 of
  82 blocks got processed in 12 msecs
  2009-05-14 04:36:13,384 INFO org.apache.hadoop.dfs.DataNode: BlockReport
 of
  82 blocks got processed in 9 msecs
  2009-05-14 05:36:14,592 INFO org.apache.hadoop.dfs.DataNode: BlockReport
 of
  82 blocks got processed in 9 msecs
  2009-05-14 06:36:15,806 INFO org.apache.hadoop.dfs.DataNode: BlockReport
 of
  82 blocks got processed in 12 msecs
  2009-05-14 07:36:14,008 INFO org.apache.hadoop.dfs.DataNode: BlockReport
 of
  82 blocks got processed in 12 msecs
  2009-05-14 08:36:15,204 INFO 

Re: Can I share datas for several map tasks?

2009-06-16 Thread Hello World
Thanks for your reply. Can you do me a favor to make a check?
I modified mapred-default.xml as follows:
540 property
541   namemapred.job.reuse.jvm.num.tasks/name
542   value-1/value
543   descriptionHow many tasks to run per jvm. If set to -1, there is
544   no limit.
545   /description
546 /property
And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop;

This is my program:

 17 public class WordCount {
 18
 19   public static class TokenizerMapper
 20extends MapperObject, Text, Text, IntWritable{
 21
 22 private final static IntWritable one = new IntWritable(1);
 23 private Text word = new Text();
 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16];
 25
 26 protected void setup(Context context
 27 ) throws IOException, InterruptedException {
 28 //Init shared data
 29 ToBeSharedData[0] = 12345;
 30 System.out.println(setup shared data[0] =  +
ToBeSharedData[0]);
 31 }
 32
 33 public void map(Object key, Text value, Context context
 34 ) throws IOException, InterruptedException {
 35   StringTokenizer itr = new StringTokenizer(value.toString());
 36   while (itr.hasMoreTokens()) {
 37 word.set(itr.nextToken());
 38 context.write(word, one);
 39   }
 40   System.out.println(read shared data[0] =  +
ToBeSharedData[0]);
 41 }
 42   }

First, can you tell me how to make sure jvm reuse is taking effect, for I
didn't see anything different from before. I use top command under linux
and see the same number of java processes and same memory usage.

Second, can you tell me how to make the ToBeSharedData be inited only once
and can be read from other MapTasks on the same node? Or this is not a
suitable programming style for map-reduce?

By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a
single-node.
thanks in advance

On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.comwrote:


 snowloong wrote:
  Hi,
  I want to share some data structures for the map tasks on a same node(not
 through files), I mean, if one map task has already initialized some data
 structures (e.g. an array or a list), can other map tasks share these
 memorys and directly access them, for I don't want to reinitialize these
 datas and I want to save some memory. Can hadoop help me do this?

 You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks
 in mapred-default.xml for usage. Then you can cache the data in a static
 variable in your mapper.

 - Sharad



Rack Configuration::!

2009-06-16 Thread Sugandha Naolekar
Hello!

How to configure the machines in different racks?

I have in all 10 machines.

Now I want the heirarchy as follows::

machine1
machine2
machine3--these are all DN and TT
machine4

machine5 -JT1

machine7
machine8-- JT2

machine10NN and Sec.NN


As of now I have 7 machines running a hadoop cluster. which follws below
heirarchy::

machine1
machine2-- these are all DN and TT
machine3
machine4

machine5--JT

machine6-Sec.NN

machine7---NN

Also, if the machines are configured in different racks, what advantage do
we have? Also, give me few problem statements which handles big amount of
data(processing). What Yahoo and Amazon guys have done? What kind of huge
processing of huge data they have handled?


-- 
Regards!
Sugandha


Re: 2009 Hadoop Summit West - was wonderful

2009-06-16 Thread zsongbo
Thanks Jason and Chuck.

On Tue, Jun 16, 2009 at 5:55 AM, Chuck Lam chuck@gmail.com wrote:

 She mentioned a number of projects. I think this one is most relevant.

 ASDF: Automated, Online Fingerpointing for Hadoop
 http://www.pdl.cmu.edu/PDL-FTP/stray/CMU-PDL-08-104_abs.html



 On Sun, Jun 14, 2009 at 6:38 PM, jason hadoop jason.had...@gmail.com
 wrote:

  This is the best I have at present: http://www.cs.cmu.edu/~priya/
 http://www.cs.cmu.edu/%7Epriya/
 
  On Sat, Jun 13, 2009 at 11:05 AM, zsongbo zson...@gmail.com wrote:
 
   Hi Jason,
   Could you please post more information about  Pria Narasimhan's toolset
   for automated fault detection in hadoop clusters?
   Such as url or others.
  
   Thanks.
   Schubert
  
   On Thu, Jun 11, 2009 at 11:26 AM, jason hadoop jason.had...@gmail.com
   wrote:
  
I had a great time, smoozing with people, and enjoyed a couple of the
   talks
   
I would love to see more from Pria Narasimhan, hope their toolset for
automated fault detection in hadoop clusters becomes generally
  available.
Zookeeper rocks on!
   
Hbase is starting to look really good, in 0.20 the master node and
 the
single point of failure and configuration headache goes away and
   Zookeeper
takes over.
   
Owen O'Mally ave a solid presentation on the new Hadoop API's and the
reasons for the changes.
   
It was good to hang with everyone, see you all next year!
   
I even got to spend a little time chatting with Tom White, and a
 signed
copy
of his book, thanks Tom!
   
   
--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals
   
  
 
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 



Hadoop as Cloud Storage

2009-06-16 Thread W
Dear Hadoop Guru's,

After googling and find some information on using hadoop as cloud
storage (long term).
I have a problem to maintain lots of data (around 50 TB) much of them
are TV Commercial (video files).

I know, the best solution for long term file archiving is using tape
backup, but i just curious, is hadoop
can be used as 'data archiving' platform ?

Thanks!

Warm Regards,
Wildan
---
OpenThink Labs
http://openthink-labs.tobethink.com/

Making IT, Business and Education in Harmony

 087884599249

Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana


Re: Rack Configuration::!

2009-06-16 Thread Harish Mallipeddi
On Tue, Jun 16, 2009 at 2:45 PM, Sugandha Naolekar
sugandha@gmail.comwrote:

 Hello!

 How to configure the machines in different racks?


 Also, if the machines are configured in different racks, what advantage do
 we have?


See this thread:
http://www.nabble.com/Hadoop-topology.script.file.name-Form-td17683521.html


 Also, give me few problem statements which handles big amount of
 data(processing). What Yahoo and Amazon guys have done? What kind of huge
 processing of huge data they have handled?


Just google map reduce applications. Read the original map-reduce paper by
Googlers.

-- 
Harish Mallipeddi
http://blog.poundbang.in


Re: MapContext.getInputSplit() returns nothing

2009-06-16 Thread Roshan James
Why dont we convert input split information into the same string format that
is displayed in the webUI? Something like this -
hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185. Its a simple format
and we can always parse such a string in C++.

Is there some reason for the current binary format? If there is good reason
for it, I am game to write such a deserialiser class. Is there some
reference for this binary format that I can use to write the deserialiser?

Roshan

On Mon, Jun 15, 2009 at 5:40 PM, Owen O'Malley omal...@apache.org wrote:

 *Sigh* We need Avro for input splits.

 That is the expected behavior. It would be great if someone wrote a C++
 FileInputSplit class that took a binary string and converted it back to a
 filename, offset, and length.

 -- Owen



Re: Can I share datas for several map tasks?

2009-06-16 Thread jason hadoop
In the examples for my book is a jvm reuse with static data shared between
jvm's example

On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote:

 Thanks for your reply. Can you do me a favor to make a check?
 I modified mapred-default.xml as follows:
540 property
541   namemapred.job.reuse.jvm.num.tasks/name
542   value-1/value
543   descriptionHow many tasks to run per jvm. If set to -1, there is
544   no limit.
545   /description
546 /property
 And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop;

 This is my program:

 17 public class WordCount {
 18
 19   public static class TokenizerMapper
 20extends MapperObject, Text, Text, IntWritable{
 21
 22 private final static IntWritable one = new IntWritable(1);
 23 private Text word = new Text();
 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16];
 25
 26 protected void setup(Context context
 27 ) throws IOException, InterruptedException {
 28 //Init shared data
 29 ToBeSharedData[0] = 12345;
 30 System.out.println(setup shared data[0] =  +
 ToBeSharedData[0]);
 31 }
 32
 33 public void map(Object key, Text value, Context context
 34 ) throws IOException, InterruptedException {
 35   StringTokenizer itr = new StringTokenizer(value.toString());
 36   while (itr.hasMoreTokens()) {
 37 word.set(itr.nextToken());
 38 context.write(word, one);
 39   }
 40   System.out.println(read shared data[0] =  +
 ToBeSharedData[0]);
 41 }
 42   }

 First, can you tell me how to make sure jvm reuse is taking effect, for I
 didn't see anything different from before. I use top command under linux
 and see the same number of java processes and same memory usage.

 Second, can you tell me how to make the ToBeSharedData be inited only
 once
 and can be read from other MapTasks on the same node? Or this is not a
 suitable programming style for map-reduce?

 By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a
 single-node.
 thanks in advance

 On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.com
 wrote:

 
  snowloong wrote:
   Hi,
   I want to share some data structures for the map tasks on a same
 node(not
  through files), I mean, if one map task has already initialized some data
  structures (e.g. an array or a list), can other map tasks share these
  memorys and directly access them, for I don't want to reinitialize these
  datas and I want to save some memory. Can hadoop help me do this?
 
  You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks
  in mapred-default.xml for usage. Then you can cache the data in a static
  variable in your mapper.
 
  - Sharad
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Datanodes fail to start

2009-06-16 Thread jason hadoop
I often find myself editing the src/saveVersion.sh to fake out the version
numbers, when I build a hadoop jar for the first time, and have to deploy it
on an an already running cluster.


On Mon, Jun 15, 2009 at 11:57 PM, Ian jonhson jonhson@gmail.com wrote:

 If you rebuilt the hadoop, following the wikipage of HowToRelease may
 reduce the trouble occurred.


 On Sat, May 16, 2009 at 7:20 AM, Pankil Doshiforpan...@gmail.com wrote:
  I got the solution..
 
  Namespace IDs where some how incompatible.So I had to clean data dir and
  temp dir ,format the cluster and make a fresh start
 
  Pankil
 
  On Fri, May 15, 2009 at 2:25 AM, jason hadoop jason.had...@gmail.com
 wrote:
 
  There should be a few more lines at the end.
  We only want the part from last the STARTUP_MSG to the end
 
  On one of mine a successfull start looks like this:
  STARTUP_MSG: Starting DataNode
  STARTUP_MSG:   host = at/192.168.1.119
  STARTUP_MSG:   args = []
  STARTUP_MSG:   version = 0.19.1-dev
  STARTUP_MSG:   build =  -r ; compiled by 'jason' on Tue Mar 17 04:03:57
 PDT
  2009
  /
  2009-03-17 03:08:11,884 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: Registered
  FSDatasetStatusMBean
  2009-03-17 03:08:11,886 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at
  50010
  2009-03-17 03:08:11,889 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is
  1048576 bytes/s
  2009-03-17 03:08:12,142 INFO org.mortbay.http.HttpServer: Version
  Jetty/5.1.4
  2009-03-17 03:08:12,155 INFO org.mortbay.util.Credential: Checking
 Resource
  aliases
  2009-03-17 03:08:12,518 INFO org.mortbay.util.Container: Started
  org.mortbay.jetty.servlet.webapplicationhand...@1e184cb
  2009-03-17 03:08:12,578 INFO org.mortbay.util.Container: Started
  WebApplicationContext[/static,/static]
  2009-03-17 03:08:12,721 INFO org.mortbay.util.Container: Started
  org.mortbay.jetty.servlet.webapplicationhand...@1d9e282
  2009-03-17 03:08:12,722 INFO org.mortbay.util.Container: Started
  WebApplicationContext[/logs,/logs]
  2009-03-17 03:08:12,878 INFO org.mortbay.util.Container: Started
  org.mortbay.jetty.servlet.webapplicationhand...@14a75bb
  2009-03-17 03:08:12,884 INFO org.mortbay.util.Container: Started
  WebApplicationContext[/,/]
  2009-03-17 03:08:12,951 INFO org.mortbay.http.SocketListener: Started
  SocketListener on 0.0.0.0:50075
  2009-03-17 03:08:12,951 INFO org.mortbay.util.Container: Started
  org.mortbay.jetty.ser...@1358f03
  2009-03-17 03:08:12,957 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
  Initializing JVM Metrics with processName=DataNode, sessionId=null
  2009-03-17 03:08:13,242 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
  Initializing RPC Metrics with hostName=DataNode, port=50020
  2009-03-17 03:08:13,264 INFO org.apache.hadoop.ipc.Server: IPC Server
  Responder: starting
  2009-03-17 03:08:13,304 INFO org.apache.hadoop.ipc.Server: IPC Server
  listener on 50020: starting
  2009-03-17 03:08:13,343 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 0 on 50020: starting
  2009-03-17 03:08:13,343 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration =
  DatanodeRegistration(192.168.1.119:50010,
  storageID=DS-540597485-192.168.1.119-50010-1237022386925,
 infoPort=50075,
  ipcPort=50020)
  2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 1 on 50020: starting
  2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 2 on 50020: starting
  2009-03-17 03:08:13,351 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
  192.168.1.119:50010,
  storageID=DS-540597485-192.168.1.119-50010-1237022386925,
 infoPort=50075,
  ipcPort=50020)In DataNode.run, data =
  FSDataset{dirpath='/tmp/hadoop-0.19.0-jason/dfs/data/current'}
  2009-03-17 03:08:13,352 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: using
 BLOCKREPORT_INTERVAL
  of 360msec Initial delay: 0msec
  2009-03-17 03:08:13,391 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 14
 blocks
  got processed in 27 msecs
  2009-03-17 03:08:13,392 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block
  scanner.
 
 
 
  On Thu, May 14, 2009 at 9:51 PM, Pankil Doshi forpan...@gmail.com
 wrote:
 
   This is log from datanode.
  
  
   2009-05-14 00:36:14,559 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 12 msecs
   2009-05-14 01:36:15,768 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 8 msecs
   2009-05-14 02:36:13,975 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 9 msecs
   2009-05-14 03:36:15,189 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 12 msecs
   2009-05-14 04:36:13,384 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 9 

Re: Debugging Map-Reduce programs

2009-06-16 Thread jason hadoop
When you are running in local mode you have 2 basic choices if you want to
interact with a debugger.
You can launch from within eclipse or other IDE, or you can setup a java
debugger transport as part of the mapred.child.java.opts variable, and
attach to the running jvm.
By far the simplest is loading via eclipse.

Your other alternative is to inform the framework to retain the job files
via keep.failed.task.files (be careful here you will fill your disk with old
dead data) and use the debug the IsolationRunner

Examples in my book :)


On Mon, Jun 15, 2009 at 6:49 PM, bharath vissapragada 
bharathvissapragada1...@gmail.com wrote:

 I am running in a local mode . Can you tell me how to set those breakpoints
 or how to access those files so that i can debug the program.

 The program is generating  = java.lang.NumberFormatException: For input
 string: 

 But that particular string is the one which is the input to the mapclass .
 So I think that it is not reading my input correctly .. But when i try to
 print the same .. it isn't printing to the STDOUT ..
 Iam using the FileInputFormat class

  FileInputFormat.addInputPath(conf, new
 Path(/home/rip/Desktop/hadoop-0.18.3/input));
 FileOutputFormat.setOutputPath(conf, new
 Path(/home/rip/Desktop/hadoop-0.18.3/output));

 input and output are folders for inp and outpt.

 It is generating these warnings also

 09/06/16 12:38:32 WARN fs.FileSystem: local is a deprecated filesystem
 name. Use file:/// instead.

 Thanks in advance


 On Tue, Jun 16, 2009 at 3:50 AM, Aaron Kimball aa...@cloudera.com wrote:

  On Mon, Jun 15, 2009 at 10:01 AM, bharath vissapragada 
  bhara...@students.iiit.ac.in wrote:
 
   Hi all ,
  
   When running hadoop in local mode .. can we use print statements to
  print
   something to the terminal ...
 
 
  Yes. In distributed mode, each task will write its stdout/stderr to files
  which you can access through the web-based interface.
 
 
  
   Also iam not sure whether the program is reading my input files ... If
 i
   keep print statements it isn't displaying any .. can anyone tell me how
  to
   solve this problem.
 
 
  Is it generating exceptions? Are the files present? If you're running in
  local mode, you can use a debugger; set a breakpoint in your map() method
  and see if it gets there. How are you configuring the input files for
 your
  job?
 
 
  
  
   Thanks in adance,
  
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: MapContext.getInputSplit() returns nothing

2009-06-16 Thread Roshan James
So after squinting at this a bit I feel this is the format:

Length of string: 00 23

String:
68 64 66 73 3A
2F 2F 6E 79 63
2D 71 77 73 2D
30 32 39 2F 69
6E 2D 64 69 72
2F 77 6F 72 64
73 2E 74 78 74

Start Offset: 00 00 00 00 00 00 00 00
Size: 00 00 00 00 00 02 C4 AC

And this should be the split for file hdfs://nyc-qws-029/in-dir/words.txt
from offset 0 to 181420.

That said, is there some reason why this is the format? I don't want the
deserialiser I write to break from one version of Hadoop to the next.

Roshan


On Tue, Jun 16, 2009 at 9:41 AM, Roshan James 
roshan.james.subscript...@gmail.com wrote:

 Why dont we convert input split information into the same string format
 that is displayed in the webUI? Something like this -
 hdfs://nyc-qws-029/in-dir/words86ac4a.txt:0+184185. Its a simple format
 and we can always parse such a string in C++.

 Is there some reason for the current binary format? If there is good reason
 for it, I am game to write such a deserialiser class. Is there some
 reference for this binary format that I can use to write the deserialiser?

 Roshan


 On Mon, Jun 15, 2009 at 5:40 PM, Owen O'Malley omal...@apache.org wrote:

 *Sigh* We need Avro for input splits.

 That is the expected behavior. It would be great if someone wrote a C++
 FileInputSplit class that took a binary string and converted it back to a
 filename, offset, and length.

 -- Owen





Re: MapContext.getInputSplit() returns nothing

2009-06-16 Thread Owen O'Malley

Sorry, I forget how much isn't clear to people who are just starting.

FileInputFormat creates FileSplits. The serialization is very stable  
and can't be changed without breaking things. The reason that pipes  
can't stringify it is that the string form of input splits are  
ambiguous (and since it is user code, we really can't make assumptions  
about it). The format of FileSplit is:


16 bit filename byte length
filename in bytes
64 bit offset
64 bit length

Technically the filename uses a funky utf-8 encoding, but in practice  
as long as the filename has ascii characters they are ascii. Look at  
org.apache.hadoop.io.UTF.writeString for the precise definition.


-- Owen


Re: Datanodes fail to start

2009-06-16 Thread Ian jonhson
On Tue, Jun 16, 2009 at 9:55 PM, jason hadoopjason.had...@gmail.com wrote:
 I often find myself editing the src/saveVersion.sh to fake out the version
 numbers, when I build a hadoop jar for the first time, and have to deploy it
 on an an already running cluster.



That is not good solution


Re: Hadoop as Cloud Storage

2009-06-16 Thread Alex Loddengaard
Hey Wildan,

HDFS is successfully storing well over 50TBs on a single cluster.  It's
meant to store data that will be analyzed in a MR job, but it can be used
for archival storage.  You'd probably consider deploying nodes with lots of
disk space vs. lots of RAM and processor power.  You'll want to do a cost
analysis to determine if tape or HDFS is cheaper.

That said, you should know a few things about HDFS:

   - Its read path is optimized for high throughput, and doesn't care as
   much about latency (read: it's got high latency relative to other file
   systems)
   - It's not meant for small files, so ideally your video files will be at
   least ~100MB each
   - It requires that the machines that makeup your cluster be running
   whenever you want to access or store data.  (Note that HDFS survives if a
   small percentage of your nodes go down; it's built with fault tolerance in
   mind)

I hope this clears things up.  Let me know if you have any other questions.

Alex

On Tue, Jun 16, 2009 at 2:44 AM, W wilda...@gmail.com wrote:

 Dear Hadoop Guru's,

 After googling and find some information on using hadoop as cloud
 storage (long term).
 I have a problem to maintain lots of data (around 50 TB) much of them
 are TV Commercial (video files).

 I know, the best solution for long term file archiving is using tape
 backup, but i just curious, is hadoop
 can be used as 'data archiving' platform ?

 Thanks!

 Warm Regards,
 Wildan
 ---
 OpenThink Labs
 http://openthink-labs.tobethink.com/

 Making IT, Business and Education in Harmony

  087884599249

 Y! : hawking_123
 Linkedln : http://www.linkedin.com/in/wildanmaulana



Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread akhil1988

Hi All,

I am running my mapred program in local mode by setting
mapred.jobtracker.local to local mode so that I can debug my code. 
The mapred program is a direct porting of my original sequential code. There
is no reduce phase.
Basically, I have just put my program in the map class.

My program takes around 1-2 min. in instantiating the data objects which are
present in the constructor of Map class(it loads some data model files,
therefore it takes some time). After the instantiation part in the
constrcutor of Map class the map function is supposed to process the input
split.

The problem is that the data objects do not get instantiated completely and
in between(whlie it is still in constructor) the program stops giving the 
exceptions pasted at bottom.
The program runs fine without mapreduce and does not require more than 2GB
memory, but in mapreduce even after doing export HADOOP_HEAPSIZE=2500(I am
working on machines with 16GB RAM), the program fails. I have also set
HADOOP_OPTS=-server -XX:-UseGCOverheadLimit as sometimes I was getting GC
Overhead Limit Exceeded exceptions also.

Somebody, please help me with this problem: I have trying to debug it for
the last 3 days, but unsuccessful. Thanks!

java.lang.OutOfMemoryError: Java heap space
at sun.misc.FloatingDecimal.toJavaFormatString(FloatingDecimal.java:889)
at java.lang.Double.toString(Double.java:179)
at java.text.DigitList.set(DigitList.java:272)
at java.text.DecimalFormat.format(DecimalFormat.java:584)
at java.text.DecimalFormat.format(DecimalFormat.java:507)
at java.text.NumberFormat.format(NumberFormat.java:269)
at 
org.apache.hadoop.util.StringUtils.formatPercent(StringUtils.java:110)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1147)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

09/06/16 12:34:41 WARN mapred.LocalJobRunner: job_local_0001
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:79)
... 5 more
Caused by: java.lang.ThreadDeath
at java.lang.Thread.stop(Thread.java:715)
at 
org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
at
org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1224)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

-- 
View this message in context: 
http://www.nabble.com/Nor-%22OOM-Java-Heap-Space%22-neither-%22GC-OverHead-Limit-Exeeceded%22-tp24059508p24059508.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Restricting quota for users in HDFS

2009-06-16 Thread Allen Wittenauer



On 6/15/09 11:16 PM, Palleti, Pallavi pallavi.pall...@corp.aol.com
wrote:
 We have chown command in hadoop dfs to make a particular directory own
 by a person. Do we have something similar to create user with some space
 limit/restrict the disk usage by a particular user?

Quotas are implemented on a per-directory basis, not per-user.   There
is no support for this user can have X space, regardless of where he/she
writes only this directory has a limit of X space, regardless of who
writes there.



Announcing CloudBase-1.3.1 release

2009-06-16 Thread Ru, Yanbo

Hi,

We have released 1.3.1 version of CloudBase on sourceforge-
https://sourceforge.net/projects/cloudbase

CloudBase is a data warehouse system for Terabyte  Petabyte scale analytics. 
It is built on top of Map-Reduce architecture. It allows you to query flat log 
files using ANSI SQL. 

Please give it a try and send us your feedback.

Thanks,

Yanbo

Release notes - 
 
New Features: 
* CREATE CSV tables - One can create tables on top of data in CSV (Comma 
Separated Values) format and query them using SQL. Current implementation 
doesn't accept CSV records which span multiple lines. Data may not be processed 
correctly if a field contains embedded line-breaks. Please visit 
http://cloudbase.sourceforge.net/index.html#userDoc for detailed specification 
of the CSV format. 
 
Bug fixes: 
* Aggregate function 'AVG' returns the same value as 'SUM' function 
* If a query has multiple aliases, only the last alias works 


Re: Can I share datas for several map tasks?

2009-06-16 Thread Hello World
I can't get your book, so can you give me a few more words to describe the
solution? very appreciate.

-snowloong

On Tue, Jun 16, 2009 at 9:51 PM, jason hadoop jason.had...@gmail.comwrote:

 In the examples for my book is a jvm reuse with static data shared between
 jvm's example

 On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote:

  Thanks for your reply. Can you do me a favor to make a check?
  I modified mapred-default.xml as follows:
 540 property
 541   namemapred.job.reuse.jvm.num.tasks/name
 542   value-1/value
 543   descriptionHow many tasks to run per jvm. If set to -1, there
 is
 544   no limit.
 545   /description
 546 /property
  And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop;
 
  This is my program:
 
  17 public class WordCount {
  18
  19   public static class TokenizerMapper
  20extends MapperObject, Text, Text, IntWritable{
  21
  22 private final static IntWritable one = new IntWritable(1);
  23 private Text word = new Text();
  24 public static int[] ToBeSharedData = new int[1024 * 1024 *
 16];
  25
  26 protected void setup(Context context
  27 ) throws IOException, InterruptedException {
  28 //Init shared data
  29 ToBeSharedData[0] = 12345;
  30 System.out.println(setup shared data[0] =  +
  ToBeSharedData[0]);
  31 }
  32
  33 public void map(Object key, Text value, Context context
  34 ) throws IOException, InterruptedException {
  35   StringTokenizer itr = new StringTokenizer(value.toString());
  36   while (itr.hasMoreTokens()) {
  37 word.set(itr.nextToken());
  38 context.write(word, one);
  39   }
  40   System.out.println(read shared data[0] =  +
  ToBeSharedData[0]);
  41 }
  42   }
 
  First, can you tell me how to make sure jvm reuse is taking effect, for
 I
  didn't see anything different from before. I use top command under
 linux
  and see the same number of java processes and same memory usage.
 
  Second, can you tell me how to make the ToBeSharedData be inited only
  once
  and can be read from other MapTasks on the same node? Or this is not a
  suitable programming style for map-reduce?
 
  By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a
  single-node.
  thanks in advance
 
  On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.com
  wrote:
 
  
   snowloong wrote:
Hi,
I want to share some data structures for the map tasks on a same
  node(not
   through files), I mean, if one map task has already initialized some
 data
   structures (e.g. an array or a list), can other map tasks share these
   memorys and directly access them, for I don't want to reinitialize
 these
   datas and I want to save some memory. Can hadoop help me do this?
  
   You can enable jvm reuse across tasks. See
 mapred.job.reuse.jvm.num.tasks
   in mapred-default.xml for usage. Then you can cache the data in a
 static
   variable in your mapper.
  
   - Sharad
  
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals



Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread jason hadoop
Is it possible that your map class is an inner class and not static?

On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com wrote:


 Hi All,

 I am running my mapred program in local mode by setting
 mapred.jobtracker.local to local mode so that I can debug my code.
 The mapred program is a direct porting of my original sequential code.
 There
 is no reduce phase.
 Basically, I have just put my program in the map class.

 My program takes around 1-2 min. in instantiating the data objects which
 are
 present in the constructor of Map class(it loads some data model files,
 therefore it takes some time). After the instantiation part in the
 constrcutor of Map class the map function is supposed to process the input
 split.

 The problem is that the data objects do not get instantiated completely and
 in between(whlie it is still in constructor) the program stops giving the
 exceptions pasted at bottom.
 The program runs fine without mapreduce and does not require more than 2GB
 memory, but in mapreduce even after doing export HADOOP_HEAPSIZE=2500(I am
 working on machines with 16GB RAM), the program fails. I have also set
 HADOOP_OPTS=-server -XX:-UseGCOverheadLimit as sometimes I was getting GC
 Overhead Limit Exceeded exceptions also.

 Somebody, please help me with this problem: I have trying to debug it for
 the last 3 days, but unsuccessful. Thanks!

 java.lang.OutOfMemoryError: Java heap space
at
 sun.misc.FloatingDecimal.toJavaFormatString(FloatingDecimal.java:889)
at java.lang.Double.toString(Double.java:179)
at java.text.DigitList.set(DigitList.java:272)
at java.text.DecimalFormat.format(DecimalFormat.java:584)
at java.text.DecimalFormat.format(DecimalFormat.java:507)
at java.text.NumberFormat.format(NumberFormat.java:269)
at
 org.apache.hadoop.util.StringUtils.formatPercent(StringUtils.java:110)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1147)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 09/06/16 12:34:41 WARN mapred.LocalJobRunner: job_local_0001
 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
at

 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:79)
... 5 more
 Caused by: java.lang.ThreadDeath
at java.lang.Thread.stop(Thread.java:715)
at
 org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
at
 org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1224)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 --
 View this message in context:
 

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread akhil1988

One more thing, finally it terminates there (after some time) by giving the
final Exception:

java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


akhil1988 wrote:
 
 Thank you Jason for your reply. 
 
 My Map class is an inner class and it is a static class. Here is the
 structure of my code.
 
 public class NerTagger {
 
 public static class Map extends MapReduceBase implements
 MapperLongWritable, Text, Text, Text{
 private Text word = new Text();
 private static NETaggerLevel1 tagger1 = new
 NETaggerLevel1();
 private static NETaggerLevel2 tagger2 = new
 NETaggerLevel2();
 
 Map(){
 System.out.println(HI2\n);

 Parameters.readConfigAndLoadExternalData(Config/allLayer1.config);
 System.out.println(HI3\n);

 Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true);
 
 System.out.println(loading the tagger);

 tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1);
 System.out.println(HI5\n);

 tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2);
 System.out.println(Done- loading the tagger);
 }
 
 public void map(LongWritable key, Text value,
 OutputCollectorText, Text output, Reporter reporter ) throws IOException
 {
 String inputline = value.toString();
 
 /* Processing of the input pair is done here */
 }
 
 
 public static void main(String [] args) throws Exception {
 JobConf conf = new JobConf(NerTagger.class);
 conf.setJobName(NerTagger);
 
 conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(IntWritable.class);
 
 conf.setMapperClass(Map.class);
 conf.setNumReduceTasks(0);
 
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);
 
 conf.set(mapred.job.tracker, local);
 conf.set(fs.default.name, file:///);
 
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Data/), conf);
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Config/), conf);
 DistributedCache.createSymlink(conf);
 
 
 conf.set(mapred.child.java.opts,-Xmx4096m);
 
 FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
 System.out.println(HI1\n);
 
 JobClient.runJob(conf);
 }
 
 Jason, when the program executes HI1 and HI2 are printed but it does not
 reaches HI3. In the statement
 Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); it is
 able to access Config/allLayer1.config file (as while executing this
 statement, it prints some messages like which data it is loading, etc.)
 but it gets stuck there(while loading some classifier) and never reaches
 HI3. 
 
 This program runs fine when executed normally(without mapreduce).
 
 Thanks, Akhil
 
 
 
 
 jason hadoop wrote:
 
 Is it possible that your map class is an inner class and not static?
 
 On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com wrote:
 

 Hi All,

 I am running my mapred program in local mode by setting
 mapred.jobtracker.local to local mode so that I can debug my code.
 The mapred program is a direct porting of my original sequential code.
 There
 is no reduce phase.
 Basically, I have just put my program in the map class.

 My program takes around 1-2 min. in instantiating the data objects which
 are
 present in the constructor of Map class(it loads some data model files,
 therefore it takes some time). After the instantiation part in the
 constrcutor of Map class the map function is supposed to process the
 input
 split.

 The problem is that the data objects do not get instantiated 

Re: Can I share datas for several map tasks?

2009-06-16 Thread Iman E
Thank you, Jason. I found the example. So, is there a way to share the same JVM 
between different jobs?





From: jason hadoop jason.had...@gmail.com
To: core-user@hadoop.apache.org
Sent: Tuesday, June 16, 2009 7:22:16 PM
Subject: Re: Can I share datas for several map tasks?

in the example code, download bundle, in the package
com.apress.hadoopbook.examples.advancedtechniques, is the class
JVMReuseAndStaticInitializers.java

which demonstrates sharing data between instances using jvm reuse.

I built this to prove to myself that it was possible.
It never got an actual write up in the book itself.

On Tue, Jun 16, 2009 at 6:55 PM, Hello World snowlo...@gmail.com wrote:

 I can't get your book, so can you give me a few more words to describe the
 solution? very appreciate.

 -snowloong

 On Tue, Jun 16, 2009 at 9:51 PM, jason hadoop jason.had...@gmail.com
 wrote:

  In the examples for my book is a jvm reuse with static data shared
 between
  jvm's example
 
  On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com
 wrote:
 
   Thanks for your reply. Can you do me a favor to make a check?
   I modified mapred-default.xml as follows:
      540 property
      541  namemapred.job.reuse.jvm.num.tasks/name
      542  value-1/value
      543  descriptionHow many tasks to run per jvm. If set to -1,
 there
  is
      544  no limit.
      545  /description
      546 /property
   And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop;
  
   This is my program:
  
      17 public class WordCount {
      18
      19  public static class TokenizerMapper
      20        extends MapperObject, Text, Text, IntWritable{
      21
      22    private final static IntWritable one = new IntWritable(1);
      23    private Text word = new Text();
      24    public static int[] ToBeSharedData = new int[1024 * 1024 *
  16];
      25
      26    protected void setup(Context context
      27            ) throws IOException, InterruptedException {
      28        //Init shared data
      29        ToBeSharedData[0] = 12345;
      30        System.out.println(setup shared data[0] =  +
   ToBeSharedData[0]);
      31    }
      32
      33    public void map(Object key, Text value, Context context
      34                    ) throws IOException, InterruptedException {
      35      StringTokenizer itr = new
 StringTokenizer(value.toString());
      36      while (itr.hasMoreTokens()) {
      37        word.set(itr.nextToken());
      38        context.write(word, one);
      39      }
      40      System.out.println(read shared data[0] =  +
   ToBeSharedData[0]);
      41    }
      42  }
  
   First, can you tell me how to make sure jvm reuse is taking effect,
 for
  I
   didn't see anything different from before. I use top command under
  linux
   and see the same number of java processes and same memory usage.
  
   Second, can you tell me how to make the ToBeSharedData be inited only
   once
   and can be read from other MapTasks on the same node? Or this is not a
   suitable programming style for map-reduce?
  
   By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a
   single-node.
   thanks in advance
  
   On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal 
 shara...@yahoo-inc.com
   wrote:
  
   
snowloong wrote:
 Hi,
 I want to share some data structures for the map tasks on a same
   node(not
through files), I mean, if one map task has already initialized some
  data
structures (e.g. an array or a list), can other map tasks share these
memorys and directly access them, for I don't want to reinitialize
  these
datas and I want to save some memory. Can hadoop help me do this?
   
You can enable jvm reuse across tasks. See
  mapred.job.reuse.jvm.num.tasks
in mapred-default.xml for usage. Then you can cache the data in a
  static
variable in your mapper.
   
- Sharad
   
  
 
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals



  

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread jason hadoop
Something is happening inside of your (Parameters.
readConfigAndLoadExternalData(Config/allLayer1.config);)
code, and the framework is killing the job for not heartbeating for 600
seconds

On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote:


 One more thing, finally it terminates there (after some time) by giving the
 final Exception:

 java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
 at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


 akhil1988 wrote:
 
  Thank you Jason for your reply.
 
  My Map class is an inner class and it is a static class. Here is the
  structure of my code.
 
  public class NerTagger {
 
  public static class Map extends MapReduceBase implements
  MapperLongWritable, Text, Text, Text{
  private Text word = new Text();
  private static NETaggerLevel1 tagger1 = new
  NETaggerLevel1();
  private static NETaggerLevel2 tagger2 = new
  NETaggerLevel2();
 
  Map(){
  System.out.println(HI2\n);
 
  Parameters.readConfigAndLoadExternalData(Config/allLayer1.config);
  System.out.println(HI3\n);
 
  Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true);
 
  System.out.println(loading the tagger);
 
 
 tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1);
  System.out.println(HI5\n);
 
 
 tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2);
  System.out.println(Done- loading the tagger);
  }
 
  public void map(LongWritable key, Text value,
  OutputCollectorText, Text output, Reporter reporter ) throws
 IOException
  {
  String inputline = value.toString();
 
  /* Processing of the input pair is done here */
  }
 
 
  public static void main(String [] args) throws Exception {
  JobConf conf = new JobConf(NerTagger.class);
  conf.setJobName(NerTagger);
 
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
 
  conf.setMapperClass(Map.class);
  conf.setNumReduceTasks(0);
 
  conf.setInputFormat(TextInputFormat.class);
  conf.setOutputFormat(TextOutputFormat.class);
 
  conf.set(mapred.job.tracker, local);
  conf.set(fs.default.name, file:///);
 
  DistributedCache.addCacheFile(new
  URI(/home/akhil1988/Ner/OriginalNer/Data/), conf);
  DistributedCache.addCacheFile(new
  URI(/home/akhil1988/Ner/OriginalNer/Config/), conf);
  DistributedCache.createSymlink(conf);
 
 
  conf.set(mapred.child.java.opts,-Xmx4096m);
 
  FileInputFormat.setInputPaths(conf, new Path(args[0]));
  FileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
  System.out.println(HI1\n);
 
  JobClient.runJob(conf);
  }
 
  Jason, when the program executes HI1 and HI2 are printed but it does not
  reaches HI3. In the statement
  Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); it
 is
  able to access Config/allLayer1.config file (as while executing this
  statement, it prints some messages like which data it is loading, etc.)
  but it gets stuck there(while loading some classifier) and never reaches
  HI3.
 
  This program runs fine when executed normally(without mapreduce).
 
  Thanks, Akhil
 
 
 
 
  jason hadoop wrote:
 
  Is it possible that your map class is an inner class and not static?
 
  On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com
 wrote:
 
 
  Hi All,
 
  I am running my mapred program in local mode by setting
  mapred.jobtracker.local to local mode so that I can debug my code.
  The mapred program is a direct porting of my original sequential code.
  There
  is no reduce phase.
  Basically, I have just put my program in the map class.
 
  My program takes around 1-2 min. in instantiating the data objects
 which
  are
  present in the constructor of Map class(it loads some 

Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-06-16 Thread Richa Khandelwal
Hi,
I faced the same problem. Try deleting the hadoop pids from the logs
directory. That worked for me.

Thanks,
Richa

On Mon, Jun 15, 2009 at 10:28 PM, ashish pareek pareek...@gmail.com wrote:

 HI ,
 I am trying to step up a hadoop cluster on 3GB machine and using hadoop
 0.18.3 and  have followed procedure given in  apache hadoop site for hadoop
 cluster.
 In conf/slaves I have added two datanode i.e including the namenode
 vitrual machine and other machine virtual machine (datanode)  . and
 have
 set up passwordless ssh between both virtual machines . But now problem
 is when I run command :

 bin/hadoop start-all.sh

 It start only one datanode on the same namenode vitrual machine but it
 doesn't start the datanode on other machine.

 in logs/hadoop-datanode.log  i get message


  INFO org.apache.hadoop.ipc.Client: Retrying
  connect to server: hadoop1/192.168.1.28:9000. Already

  tried 1 time(s).

  2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
  connect to server: hadoop1/192.168.1.28:9000. Already tried 2 time(s).

  2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
  connect to server: hadoop1/192.168.1.28:9000. Already tried 3 time(s).


 .
 .
 .
 .
 .
 .
 .
 .
 .

 .
 .

 .

 I have tried formatting and start the cluster again .but still I
 get the same error.

 So can any one help in solving this problem. :)

 Thanks

 Regards

 Ashish Pareek




-- 
Richa Khandelwal
University of California,
Santa Cruz
CA


[ANN] HBase 0.20.0-alpha available for download

2009-06-16 Thread stack
An alpha version of HBase 0.20.0 is available for download at:

  http://people.apache.org/~stack/hbase-0.20.0-alpha/

We are making this release available to preview what is coming in HBase
0.20.0.  In short, 0.20.0 is about performance and high-availability.  Also,
a new, richer API has been added and the old deprecated.  Here is a list of
almost 300 issues addressed so far in 0.20.0: http://tinyurl.com/ntvheo

This alpha release contains known bugs.  See http://tinyurl.com/kvfsft for
the current list.  In particular, this alpha release is without a migration
script to bring your 0.19.x era data forward to work on hbase 0.20.0.  A
working, well-tested migration script will be in place before we cut the
first HBase 0.20.0 release candidate some time in the next week or so.

After download, please take the time to review the 0.20.0 'Getting Started'
also available here:
http://people.apache.org/~stack/hbase-0.20.0-alpha/docs/api/overview-summary.html#overview_description.
HBase 0.20.0 has new dependencies, in particular it now depends on
ZooKeeper.  With ZooKeeper in the mix a few core HBase configurations have
been removed and replaced with ZooKeeper configurations instead.

Also of note, HBase 0.20.0 will include Stargate, an improved REST
connector for HBase.  The old, bundled REST connector will be deprecated.
Stargate is implemented using the Jersey framework.  It includes protobuf
encoding support, has caching proxy awareness, supports batching for
scanners and updates, and in general has the goal of enabling Web scale
storage systems (a la S3) backed by HBase.  Currently its only available up
on github, http://github.com/macdiesel/stargate/tree/master.  It will be
added to a new contrib directory before we cut a release candidate.

Please let us know if you have difficulty with the install, if you find the
documentation missing or, if you trip over bugs hbasing.

Yours,
The HBasistas


Re: [ANN] HBase 0.20.0-alpha available for download

2009-06-16 Thread Bradford Stephens
Oh sweet. This will be a most excellent party.

On Tue, Jun 16, 2009 at 10:23 PM, stackst...@duboce.net wrote:
 An alpha version of HBase 0.20.0 is available for download at:

  http://people.apache.org/~stack/hbase-0.20.0-alpha/

 We are making this release available to preview what is coming in HBase
 0.20.0.  In short, 0.20.0 is about performance and high-availability.  Also,
 a new, richer API has been added and the old deprecated.  Here is a list of
 almost 300 issues addressed so far in 0.20.0: http://tinyurl.com/ntvheo

 This alpha release contains known bugs.  See http://tinyurl.com/kvfsft for
 the current list.  In particular, this alpha release is without a migration
 script to bring your 0.19.x era data forward to work on hbase 0.20.0.  A
 working, well-tested migration script will be in place before we cut the
 first HBase 0.20.0 release candidate some time in the next week or so.

 After download, please take the time to review the 0.20.0 'Getting Started'
 also available here:
 http://people.apache.org/~stack/hbase-0.20.0-alpha/docs/api/overview-summary.html#overview_description.
 HBase 0.20.0 has new dependencies, in particular it now depends on
 ZooKeeper.  With ZooKeeper in the mix a few core HBase configurations have
 been removed and replaced with ZooKeeper configurations instead.

 Also of note, HBase 0.20.0 will include Stargate, an improved REST
 connector for HBase.  The old, bundled REST connector will be deprecated.
 Stargate is implemented using the Jersey framework.  It includes protobuf
 encoding support, has caching proxy awareness, supports batching for
 scanners and updates, and in general has the goal of enabling Web scale
 storage systems (a la S3) backed by HBase.  Currently its only available up
 on github, http://github.com/macdiesel/stargate/tree/master.  It will be
 added to a new contrib directory before we cut a release candidate.

 Please let us know if you have difficulty with the install, if you find the
 documentation missing or, if you trip over bugs hbasing.

 Yours,
 The HBasistas



Problem in viewing WEB UI

2009-06-16 Thread ashish pareek
Hi,

  When I run command *bin/hadoop dfsadmin -report *it shows that 2
datanodes are alive but when i try to http://hadoopmster:50070/ but the
problem is that it opens doesnot opne
http://hadoopmaster:50070/dfshealth.jsp page and throws *error HTTP: 404 .
So why is't happening like this?
*
Regards,
Ashish Pareek


 On Wed, Jun 17, 2009 at 10:06 AM, Sugandha Neaolekar 
sugandha@gmail.com wrote:

 Well, You just have to specify the address in the URL address bar as::
 http://hadoopmaster:50070 U'll be able to see the web UI..!


 On Tue, Jun 16, 2009 at 7:17 PM, ashish pareek pareek...@gmail.comwrote:

 HI Sugandha,
Hmmm your suggestion helped and Now I am able
 to run two datanode one on the same machine as name node and other on
 the different machine Thanks a lot :)

  But the problem is now I am not able to see web UI .
 for  both datanode and as well as name node
 should I have to consider some more things in the site.xml ? if so please
 help...

 Thanking you again,
 regards,
 Ashish Pareek.

 On Tue, Jun 16, 2009 at 3:10 PM, Sugandha Naolekar 
 sugandha@gmail.com wrote:

 hi,,!


 First of all, get your concepts clear of hadoop.
 You can refer to the following

 site::
 http://www.google.co.in/url?sa=tsource=webct=rescd=1url=http%3A%2F%2Fwww.michael-noll.com%2Fwiki%2FRunning_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)ei=lGU3Spv2FZbLjAe19KmiDQusg=AFQjCNFbmVGsoChOSMzCB3tRhoV0ylHOzAsig2=t2AJ_nf24SFtveN4PHS_TAhttp://www.google.co.in/url?sa=tsource=webct=rescd=1url=http%3A%2F%2Fwww.michael-noll.com%2Fwiki%2FRunning_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29ei=lGU3Spv2FZbLjAe19KmiDQusg=AFQjCNFbmVGsoChOSMzCB3tRhoV0ylHOzAsig2=t2AJ_nf24SFtveN4PHS_TA


 I have small doubt whether in the mater.xml and slave.xml we can have
 same port numbers to both of them like


 for slave :

 property
 namefs.default.name/name
 valuehdfs://hadoopslave:

 9000/value
   /property


  for master:::

 property
 namefs.default.name/name
 valuehdfs://hadoopmaster:9000/value
   /property



 Well, any  two daemons or services can run on the same port unless, they
 are not run on the same machine.If you wish to run DN and NN on the same
 machine, their port numbers have to be different.




 On Tue, Jun 16, 2009 at 2:55 PM, ashish pareek pareek...@gmail.comwrote:

 HI sugandha,



 and one more thing can we have in slave:::

 property
   namedfs.datanode.address/

 name
   valuehadoopmaster:9000/value
 valuehadoopslave:9001/value
   /property



 Also, fs,default.name is the tag which specifies the default filesystem.
 And generaLLY, it is run on namenode. So, it;s value has to be a namenode's
 address only and not slave's.



 Else if you have complete procedure for installing and running Hadoop in
 cluster can you please send it to me .. I need to step up hadoop with 
 in
 two days and show it to my guide.Currently I am doing my masters.

 Thanks for your spending time


 Try for the above, and this should work!



 regards,
 Ashish Pareek


 On Tue, Jun 16, 2009 at 2:33 PM, Sugandha Naolekar 
 sugandha@gmail.com wrote:

 Following changes are to be done::

 Under master folder::

 - put slaves address as well under the values of
 tag(dfs.datanode.address)

 - You want to make namenode as datanode as well. As per your config
 file, you have specified hadoopmaster  in your slave file. If you don't 
 want
 that, remove ti from slaves file.

 UNder slave folder::

 - put only slave's (the m/c where you intend to run your datanode)'s
 address.under datanode.address tag. Else
 it should go as such::

 property
   namedfs.datanode.address/name
   valuehadoopmaster:9000/value
 valuehadoopslave:9001/value
   /property

 Also, your port numbers hould be different. the daemons NN,DN,JT,TT
 should run independently on different ports.


 On Tue, Jun 16, 2009 at 2:05 PM, Sugandha Naolekar 
 sugandha@gmail.com wrote:



 -- Forwarded message --
 From: ashish pareek pareek...@gmail.com
 Date: Tue, Jun 16, 2009 at 2:00 PM
 Subject: Re: org.apache.hadoop.ipc.client : trying connect to server
 failed
 To: Sugandha Naolekar sugandha@gmail.com




 On Tue, Jun 16, 2009 at 1:58 PM, ashish pareek 
 pareek...@gmail.comwrote:

 HI ,
  I am sending .tar.gz format containing both master and datanode
 config files ...

 Regards,
 Ashish Pareek


 On Tue, Jun 16, 2009 at 1:47 PM, Sugandha Naolekar 
 sugandha@gmail.com wrote:

 can u pls send me a zip or a tar file? I don't have windows systems
 but, linux


 On Tue, Jun 16, 2009 at 1:19 PM, ashish pareek pareek...@gmail.com
  wrote:

 HI Sungandha ,
   Thanks for your reply  I am sending you
 master and slave configuration files if you can go through it and 
 tell me
 where I am going wrong it would be helpful.

 Hope to get a reply soon ... Thanks
 again!

 Regards,
 Ashish Pareek

 On Tue, Jun 16,