Re: Changing the maximum tasks per node on a per job basis

2013-05-23 Thread Harsh J
Your problem seems to surround available memory and over-subscription. If
you're using a 0.20.x or 1.x version of Apache Hadoop, you probably want to
use the CapacityScheduler to address this for you.

I once detailed how-to, on a similar question here:
http://search-hadoop.com/m/gnFs91yIg1e


On Wed, May 22, 2013 at 2:55 PM, Steve Lewis lordjoe2...@gmail.com wrote:

 I have a series of Hadoop jobs to run - one of my jobs requires larger than
 standard memory
 I allow the task to use 2GB of memory. When I run some of these jobs the
 slave nodes are crashing because they run out of swap space. It is not that
 s slave count not run one. or even 4  of these jobs but 8 stresses the
 limits.
  I could cut the mapred.tasktracker.reduce.tasks.maximum for the entire
 cluster but this cripples the whole cluster for one of many jobs.
 It seems to be a very bad design
 a) to allow the job tracker to keep assigning tasks to a slave that is
 already getting low on memory
 b) to allow the user to run jobs capable or crashing noeds on the cluster
 c) not to allow the user to specify that some jobs need to be limited to a
 lower value without requiring this limit for every job.

 Are there plans to fix this??

 --




-- 
Harsh J


dncp_block_verification log

2013-05-23 Thread Brahma Reddy Battula
Hi All,



On some systems, I noticed that when the scanner runs, the 
dncp_block_verification.log.curr file under the block pool gets quite large ..





Please let me know..



i) why it is growing in only some machines..?

ii) Wht's solution..?



Following links also will describes the problem



http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201303.mbox/%3ccajzooycpad5w6cqdteliufy-h9r0pind9f0xelvt2bftwmm...@mail.gmail.com%3E



Thanks



Brahma Reddy


pauses during startup (maybe network related?)

2013-05-23 Thread Ted
Hi I'm running hadoop on my local laptop for development and
everything works but there's some annoying pauses during the startup
which causes the entire hadoop startup process to take up to 4 minutes
and I'm wondering what it is and if I can do anything about it.

I'm running everything on 1 machines, on fedora linux, hadoop-1.1.2,
oracle jkd1.7.0_17, the machine is a dual core i5, and I have 8gb of
ram and an SSD so it shouldn't be slow.

When the system pauses, there is no cpu usage, no disk usage and no
network usage (although I suspect it's waiting for the network to
resolve or return something).

Here's some snippets from the namenode logs during startup where you
can see it just pauses for around 30 seconds or more with out errors
or anything :

...
2013-05-23 19:26:37,660 INFO
org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
hadoop-metrics2.properties
2013-05-23 19:26:37,676 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
MetricsSystem,sub=Stats registered.
2013-05-23 19:27:54,144 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
period at 10 second(s).
2013-05-23 19:27:54,144 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics
system started
...
2013-05-23 19:27:54,341 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: The
dfs.support.append option is in your configuration, however append is
not supported. This configuration option is no longer required to
enable sync.
2013-05-23 19:27:54,341 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
accessTokenLifetime=0 min(s)
2013-05-23 19:28:19,918 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
FSNamesystemStateMBean and NameNodeMXBean
2013-05-23 19:28:19,937 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
occuring more than 10 times
...
2013-05-23 19:28:26,801 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 28 on 9000: starting
2013-05-23 19:28:26,833 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 31 on 9000: starting
2013-05-23 19:30:10,644 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* NameSystem.registerDatanode: node registration from
127.0.0.1:50010 storage DS-651015167-192.168.1.5-50010-1369140176513
2013-05-23 19:30:10,650 INFO org.apache.hadoop.net.NetworkTopology:
Adding a new node: /default-rack/127.0.0.1:50010


I already start the system with : export
HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
I only allocate : export HADOOP_HEAPSIZE=512 (but it's an empty hadoop
system, maybe just 1 or 2 test files less than 100k, and there's no
CPU usage so it doesn't look like it's GC thrashing)

I should mention again, there's no errors and the system runs fine and
relatively speedy once started (considering it's on my laptop).

Does anyone know what's causing these pauses? (and how I can get rid of them)
Thanks.
-- 
Ted.


Hadoop Rack awareness on virtual system

2013-05-23 Thread Jitendra Yadav
Hi,

Can we create and test hadoop rack awareness functionality in virtual box
system(like on laptop .etc)?.


Thanks~


Re: dncp_block_verification log

2013-05-23 Thread Harsh J
Hi,

What is your HDFS version? I vaguely remember this to be a problem in the
2.0.0 version or so where there was also a block scanner excessive work
bug, but I'm not sure what fixed it. I've not seen it appear in the later
releases.


On Thu, May 23, 2013 at 12:08 PM, Brahma Reddy Battula 
brahmareddy.batt...@huawei.com wrote:

  Hi All,



 On some systems, I noticed that when the scanner runs, the
 dncp_block_verification.log.curr file under the block pool gets quite large
 ..





 Please let me know..



 i) why it is growing in only some machines..?

 ii) Wht's solution..?



 Following links also will describes the problem




 http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201303.mbox/%3ccajzooycpad5w6cqdteliufy-h9r0pind9f0xelvt2bftwmm...@mail.gmail.com%3E



 Thanks



 Brahma Reddy




-- 
Harsh J


RE: dncp_block_verification log

2013-05-23 Thread Brahma Reddy Battula
HI Harsh





Thanks for reply...



I am using hadoop-2.0.1





From: Harsh J [ha...@cloudera.com]
Sent: Thursday, May 23, 2013 8:24 PM
To: user@hadoop.apache.org
Subject: Re: dncp_block_verification log

Hi,

What is your HDFS version? I vaguely remember this to be a problem in the 2.0.0 
version or so where there was also a block scanner excessive work bug, but I'm 
not sure what fixed it. I've not seen it appear in the later releases.


On Thu, May 23, 2013 at 12:08 PM, Brahma Reddy Battula 
brahmareddy.batt...@huawei.commailto:brahmareddy.batt...@huawei.com wrote:

Hi All,



On some systems, I noticed that when the scanner runs, the 
dncp_block_verification.log.curr file under the block pool gets quite large ..





Please let me know..



i) why it is growing in only some machines..?

ii) Wht's solution..?



Following links also will describes the problem



http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201303.mbox/%3ccajzooycpad5w6cqdteliufy-h9r0pind9f0xelvt2bftwmm...@mail.gmail.com%3E



Thanks



Brahma Reddy



--
Harsh J


Hadoop Installation Mappers setting

2013-05-23 Thread Jitendra Yadav
Hi,

While installing hadoop cluster, how we can calculate the exact number of
mappers value.


Thanks~


Out of memory error by Node Manager, and shut down

2013-05-23 Thread Krishna Kishore Bonagiri
Hi,

  I have got the following error in node manager's log, and it got shut
down, after about 1 application were run after it was started. Any clue
why does it occur... or is this a bug?


2013-05-22 11:53:34,456 FATAL
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[process
reaper,5,main] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830,
errno 11
at java.lang.Thread.startImpl(Native Method)
at java.lang.Thread.start(Thread.java:887)
at java.lang.ProcessInputStream.init(UNIXProcess.java:472)
at java.lang.UNIXProcess$1$1$1.run(UNIXProcess.java:157)
at
java.security.AccessController.doPrivileged(AccessController.java:202)
at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:137)


Thanks,
Kishore


Re: Hadoop Installation Mappers setting

2013-05-23 Thread bejoy . hadoop
Hi

I assume the question is on how many slots.

It dependents on
- the child/task jvm size and the available memory.
- available number of cores 



Your available memory for tasks is total memory - memory used for OS and other 
services running on your box.

Other services include non hadoop services as well as hadoop daemons.

 Divide the available memory with child jvm size and that would get the max num 
of slots.

 Also check whether sufficient number of cores are available as well.



Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Jitendra Yadav jeetuyadav200...@gmail.com
Date: Thu, 23 May 2013 18:10:38 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Hadoop Installation Mappers setting

Hi,

While installing hadoop cluster, how we can calculate the exact number of
mappers value.


Thanks~



Re: Hadoop Rack awareness on virtual system

2013-05-23 Thread Leonid Fedotov
You definitely can.
Just set rack script on your VMs.

Leonid


On Thu, May 23, 2013 at 2:50 AM, Jitendra Yadav
jeetuyadav200...@gmail.comwrote:

 Hi,

 Can we create and test hadoop rack awareness functionality in virtual box
 system(like on laptop .etc)?.


 Thanks~



Hadoop Classpath issue.

2013-05-23 Thread Dhanasekaran Anbalagan
Hi Guys,

When i trying to execute hadoop fs -ls / command
It's return extra two lines.

226:~# hadoop fs -ls /
*common ./*
*lib lib*
Found 9 items
drwxrwxrwx   - hdfs   supergroup  0 2013-03-07 04:46 /benchmarks
drwxr-xr-x   - hbase  hbase   0 2013-05-23 08:59 /hbase
drwxr-xr-x   - hdfs   supergroup  0 2013-02-20 13:21 /mapred
drwxr-xr-x   - tech   supergroup  0 2013-05-03 05:15 /test
drwxrwxrwx   - mapred supergroup  0 2013-05-23 09:33 /tmp
drwxrwxr-x   - hdfs   supergroup  0 2013-02-20 16:32 /user
drwxr-xr-x   - hdfs   supergroup  0 2013-02-20 15:10 /var


In other machines. Not return extra to lines. Please guide me how to remove
this line.

226:~# /usr/bin/hadoop classpath
common ./
lib lib
/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-0.20-mapreduce/./:/usr/lib/hadoop-0.20-mapreduce/lib/*:/usr/lib/hadoop-0.20-mapreduce/.//*


Please guide me How to fix this.

-Dhanasekaran
Did I learn something today? If not, I wasted it.


Re: Hadoop Rack awareness on virtual system

2013-05-23 Thread Jitendra Yadav
Hi Leonid,

Thanks for you reply.

please you please give me an example how to make topology.sh file?

Lets say I have below slave servers(data nodes)

192.168.45.1  dnode1
 192.168.45.2  dnode2
 192.168.45.3  dnode3
 192.168.45.4  dnode4
 192.168.45.5  dnode5

Thanks

On Thu, May 23, 2013 at 8:02 PM, Leonid Fedotov lfedo...@hortonworks.comwrote:

 You definitely can.
 Just set rack script on your VMs.

 Leonid


 On Thu, May 23, 2013 at 2:50 AM, Jitendra Yadav 
 jeetuyadav200...@gmail.com wrote:

 Hi,

 Can we create and test hadoop rack awareness functionality in virtual box
 system(like on laptop .etc)?.


 Thanks~





Re: Hadoop Rack awareness on virtual system

2013-05-23 Thread Harsh J
An example topology file and script is available on the Wiki at
http://wiki.apache.org/hadoop/topology_rack_awareness_scripts


On Thu, May 23, 2013 at 8:38 PM, Jitendra Yadav
jeetuyadav200...@gmail.comwrote:

 Hi Leonid,

 Thanks for you reply.

 please you please give me an example how to make topology.sh file?

 Lets say I have below slave servers(data nodes)

 192.168.45.1  dnode1
  192.168.45.2  dnode2
  192.168.45.3  dnode3
  192.168.45.4  dnode4
  192.168.45.5  dnode5

 Thanks

 On Thu, May 23, 2013 at 8:02 PM, Leonid Fedotov 
 lfedo...@hortonworks.comwrote:

 You definitely can.
 Just set rack script on your VMs.

 Leonid


 On Thu, May 23, 2013 at 2:50 AM, Jitendra Yadav 
 jeetuyadav200...@gmail.com wrote:

 Hi,

 Can we create and test hadoop rack awareness functionality in virtual
 box system(like on laptop .etc)?.


 Thanks~






-- 
Harsh J


Re: Out of memory error by Node Manager, and shut down

2013-05-23 Thread Pramod N
Looks like the problem is with jvm heap size. Its trying to create a new
thread and threads require native memory for internal JVM related things.

One of the possible solution is to reduce java heap size(to increase free
native memory). Is there any other information about the memory status
(malloc debug information etc) on NN? That would give more information
about the NN's memory status.

Hope this helps.

*Pramod N*
Bruce Wayne of web
@machinelearner https://twitter.com/machinelearner


On Thu, May 23, 2013 at 6:42 PM, Krishna Kishore Bonagiri 
write2kish...@gmail.com wrote:

 Hi,

   I have got the following error in node manager's log, and it got shut
 down, after about 1 application were run after it was started. Any clue
 why does it occur... or is this a bug?


 2013-05-22 11:53:34,456 FATAL
 org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[process
 reaper,5,main] threw an Error.  Shutting down now...
 java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830,
 errno 11
 at java.lang.Thread.startImpl(Native Method)
 at java.lang.Thread.start(Thread.java:887)
 at java.lang.ProcessInputStream.init(UNIXProcess.java:472)
 at java.lang.UNIXProcess$1$1$1.run(UNIXProcess.java:157)
 at
 java.security.AccessController.doPrivileged(AccessController.java:202)
 at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:137)


 Thanks,
 Kishore



Re: R for Hadoop

2013-05-23 Thread Amal G Jose
Try Rhipe, it is good.
http://amalgjose.wordpress.com/2013/05/05/rhipe-installation/

http://www.datadr.org/

http://amalgjose.wordpress.com/2013/05/05/r-installation-in-linux-platforms/


On Mon, May 20, 2013 at 2:23 PM, sudhakara st sudhakara...@gmail.comwrote:

 Hi
 You find good start up materiel for RHadoop here,

 https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md

 http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/

 I am also working on RHadoop,  mail me if any difficulty in RHadoop.


 On Mon, May 20, 2013 at 12:20 AM, Marco Shaw marco.s...@gmail.com wrote:

 You can try to search for Rhadoop using your favourite search engine.

 I think you are going to have to put in a bit more effort on your own.

 Marco





 --

 Regards,
 ...Sudhakara.st




RE: Shuffle phase replication factor

2013-05-23 Thread John Lilley
Ling,
Thanks for the response!  I could use more clarification on item 1.  
Specifically

* mapred.reduce.parallel.copies  limits the number of outbound 
connections for a reducer, but not the inbound connections for a mapper.  Does 
tasktracker.http.threads limit the number of simultaneous inbound connections 
for a mapper, or only the size of the thread pool servicing the connections?  
(i.e. is it one thread per inbound connection?).

* Who actually creates the listen port for serving up the mapper files? 
 The mapper task?  Or something more persistent in MapReduce?
Thanks,
John

From: erlv5...@gmail.com [mailto:erlv5...@gmail.com] On Behalf Of Kun Ling
Sent: Wednesday, May 22, 2013 7:50 PM
To: user
Subject: Re: Shuffle phase replication factor

Hi John,


   1. for the number of  simultaneous connection limitations. You can configure 
this using the mapred.reduce.parallel.copies flag. the default  is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a 
little. Normally, each reducer will connect to each mapper task, and asking for 
the partions of the map output file.   Because there are about 5 simultaneous 
connections to fetch the map output for each reducer. For a large MR cluster 
with 1000 node, and a Huge MR job with 1000 Mapper, and 1000 reducer, for each 
node, there are only about 5 connections. So the imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is: 
just try to reconnect.There is a List, which maintain all the output of 
the Mapper that need to copied, and the element will be removed iff the map 
output is successfully copied.  A forever loop will keep on look into the List, 
and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially the 
ReduceTask.java file.

yours,
Ling Kun

On Wed, May 22, 2013 at 10:57 PM, John Lilley 
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
U, is that also the limit for the number of simultaneous connections?  In 
general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side 
aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee 
[mailto:rahul.rec@gmail.commailto:rahul.rec@gmail.com]
Sent: Wednesday, May 22, 2013 8:52 AM

To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for 
copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley 
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
This brings up another nagging question I've had for some time.  Between HDFS 
and shuffle, there seems to be the potential for every node connecting to 
every other node via TCP.  Are there explicit mechanisms in place to manage or 
limit simultaneous connections?  Is the protocol simply robust enough to allow 
a server-side to disconnect at any time to free up slots and the client-side 
will retry the request?
Thanks
john

From: Shahab Yunus 
[mailto:shahab.yu...@gmail.commailto:shahab.yu...@gmail.com]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really 
definitive :) place to start. It is pretty thorough for starts and once you are 
gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley 
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this 
purpose?
Thanks for your indulgence... I would really like to read more about this 
without bothering the group but not sure where to start to learn these 
internals other than the code.
john

From: Kai Voigt [mailto:k...@123.orgmailto:k...@123.org]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to 
its local disk, the reduce tasks will pull the data through HTTP for further 
processing.

Am 21.05.2013 um 19:57 schrieb John Lilley 
john.lil...@redpoint.netmailto:john.lil...@redpoint.net:

When MapReduce enters shuffle to partition the tuples, I am assuming that it 
writes intermediate data to HDFS.  What replication factor is used for those 
temporary files?
john


--
Kai Voigt
k...@123.orgmailto:k...@123.org








--
http://www.lingcc.com


Re: Shuffle phase replication factor

2013-05-23 Thread Sandy Ryza
In MR1, the tasktracker serves the mapper files (so that tasks don't have
to stick around taking up resources).  In MR2, the shuffle service, which
lives inside the nodemanager, serves them.

-Sandy


On Thu, May 23, 2013 at 10:22 AM, John Lilley john.lil...@redpoint.netwrote:

  Ling,

 Thanks for the response!  I could use more clarification on item 1.
 Specifically

 **· **mapred.reduce.parallel.copies  limits the number of
 outbound connections for a reducer, but not the inbound connections for a
 mapper.  Does tasktracker.http.threads limit the number of simultaneous
 inbound connections for a mapper, or only the size of the thread pool
 servicing the connections?  (i.e. is it one thread per inbound connection?).
 

 **· **Who actually creates the listen port for serving up the
 mapper files?  The mapper task?  Or something more persistent in MapReduce?
 

 Thanks,

 John

 ** **

 *From:* erlv5...@gmail.com [mailto:erlv5...@gmail.com] *On Behalf Of *Kun
 Ling
 *Sent:* Wednesday, May 22, 2013 7:50 PM
 *To:* user

 *Subject:* Re: Shuffle phase replication factor

 ** **

 Hi John, 

 ** **

 ** **

1. for the number of  simultaneous connection limitations. You can
 configure this using the mapred.reduce.parallel.copies flag. the default
  is 5. 

 ** **

2. For the aggressively disconnect implication, I am afraid it is only
 a little. Normally, each reducer will connect to each mapper task, and
 asking for the partions of the map output file.   Because there are about 5
 simultaneous connections to fetch the map output for each reducer. For a
 large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
 1000 reducer, for each node, there are only about 5 connections. So the
 imply is only a little.

 ** **

 ** **

   3.  What happens to the pending/ failing coonection, the short answer
 is: just try to reconnect.There is a List, which maintain all the
 output of the Mapper that need to copied, and the element will be removed
 iff the map output is successfully copied.  A forever loop will keep on
 look into the List, and fetch the corrsponding map output.

 ** **

 ** **

   All the above answer is based on the Hadoop 1.0.4 source code,
 especially the ReduceTask.java file.

 ** **

 yours,

 Ling Kun

 ** **

 On Wed, May 22, 2013 at 10:57 PM, John Lilley john.lil...@redpoint.net
 wrote:

 U, is that also the limit for the number of simultaneous connections?
 In general, one does not need a 1:1 map between threads and connections.**
 **

 If this is the connection limit, does it imply  that the client or server
 side aggressively disconnects after a transfer?  

 What happens to the pending/failing connection attempts that exceed the
 limit?

 Thanks!

 john

  

 *From:* Rahul Bhattacharjee [mailto:rahul.rec@gmail.com]
 *Sent:* Wednesday, May 22, 2013 8:52 AM


 *To:* user@hadoop.apache.org
 *Subject:* Re: Shuffle phase replication factor

  

 There are properties/configuration to control the no. of copying threads
 for copy.
 tasktracker.http.threads=40
 Thanks,
 Rahul

  

 On Wed, May 22, 2013 at 8:16 PM, John Lilley john.lil...@redpoint.net
 wrote:

 This brings up another nagging question I’ve had for some time.  Between
 HDFS and shuffle, there seems to be the potential for “every node
 connecting to every other node” via TCP.  Are there explicit mechanisms in
 place to manage or limit simultaneous connections?  Is the protocol simply
 robust enough to allow a server-side to disconnect at any time to free up
 slots and the client-side will retry the request?

 Thanks

 john

  

 *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
 *Sent:* Wednesday, May 22, 2013 8:38 AM


 *To:* user@hadoop.apache.org
 *Subject:* Re: Shuffle phase replication factor

  

 As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
 definitive :) place to start. It is pretty thorough for starts and once you
 are gone through it, the code will start making more sense too.

  

 Regards,

 Shahab

  

 On Wed, May 22, 2013 at 10:33 AM, John Lilley john.lil...@redpoint.net
 wrote:

 Oh I see.  Does this mean there is another service and TCP listen port for
 this purpose?

 Thanks for your indulgence… I would really like to read more about this
 without bothering the group but not sure where to start to learn these
 internals other than the code.

 john

  

 *From:* Kai Voigt [mailto:k...@123.org]
 *Sent:* Tuesday, May 21, 2013 12:59 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Shuffle phase replication factor

  

 The map output doesn't get written to HDFS. The map task writes its output
 to its local disk, the reduce tasks will pull the data through HTTP for
 further processing.

  

 Am 21.05.2013 um 19:57 schrieb John Lilley john.lil...@redpoint.net:


Re: Is there a way to limit # of hadoop tasks per user at runtime?

2013-05-23 Thread Amal G Jose
You can use capacity scheduler also. In that you can create some queues,
each of specific capacity. Then you can submit jobs to that specific queue
at runtime or you can configure it as direct submission.


On Wed, May 22, 2013 at 3:27 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Mehmet,

 Are you using MR1 or MR2?

 The fair scheduler, present in both versions, but configured slightly
 differently, allows you to limit the number of map and reduce tasks in a
 queue.  The configuration can be updated at runtime by modifying the
 scheduler's allocations file.  It also has a feature that automatically
 maps jobs to queues based on the user submitted them.

 Here are links to documentation in MR1 and MR2:
 http://hadoop.apache.org/docs/stable/fair_scheduler.html

 http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html

 -Sandy



 On Tue, May 21, 2013 at 2:43 PM, Mehmet Belgin 
 mehmet.bel...@oit.gatech.edu wrote:

 Hi Everyone,

 I was wondering if there is a way for limiting the number of tasks
 (map+reduce) *per user* at runtime? Using an environment variable perhaps?
 I am asking this from a resource provisioning perspective. I am trying to
 come up with a N-token licensing system for multiple users to use our
 limited hadoop resources simultaneously. That is, when user A checks out 6
 tokens,  he/she can only run 6 hadoop tasks.

 If there is no such thing in hadoop, has anyone tried to integrate hadoop
 with torque/moab (or any other RM or scheduler)? Any advice in that
 direction will be appreciated :)

 Thanks in advance,
 -Mehmet










Re: Hadoop Installation Mappers setting

2013-05-23 Thread Amal G Jose
I am explaining it more.
If your machine have 8 GB of memory.
After reserving to Operating system and all other processes except
tasktracker, you have 4 GB remaining(assume).
The remaining process running is tasktracker.
If the child jvm size is 200 MB,
Then you can define a maximum slots of 4*1024 MB/ 200 MB
Which is approximately 20.
You can divide the slots into mapper and reducer slots as per your
requirement.
This is just an example that I explained based on my knowledge.



On Thu, May 23, 2013 at 7:48 PM, bejoy.had...@gmail.com wrote:

 **
 Hi

 I assume the question is on how many slots.

 It dependents on
 - the child/task jvm size and the available memory.
 - available number of cores



 Your available memory for tasks is total memory - memory used for OS and
 other services running on your box.

 Other services include non hadoop services as well as hadoop daemons.

 Divide the available memory with child jvm size and that would get the max
 num of slots.

 Also check whether sufficient number of cores are available as well.


 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Jitendra Yadav jeetuyadav200...@gmail.com
 *Date: *Thu, 23 May 2013 18:10:38 +0530
 *To: *user@hadoop.apache.org
 *ReplyTo: * user@hadoop.apache.org
 *Subject: *Hadoop Installation Mappers setting

 Hi,

 While installing hadoop cluster, how we can calculate the exact number of
 mappers value.


 Thanks~



HDFS data and non-aligned splits

2013-05-23 Thread John Lilley
What happens when MR produces data splits, and those splits don't align on 
block boundaries?  I've read that MR will attempt to make data splits near 
block boundaries to improve data locality, but isn't there always some slop 
where records straddle the block boundaries, resulting in an extra HDFS 
connection just to get the half-record in the other block?  Does this impact 
performance?  Are there file formats that attempt to enforce data alignment?



SequenceFile sync marker uniqueness

2013-05-23 Thread John Lilley
How does SequenceFile guarantee that the sync marker does not appear in the 
data?
John



Re: pauses during startup (maybe network related?)

2013-05-23 Thread Chris Nauroth
Hi Ted,

2013-05-23 19:28:19,937 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
occuring more than 10 times
...
2013-05-23 19:28:26,801 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 28 on 9000: starting

There are a couple of relevant activities that happen during namenode
startup in between these 2 log statements.  It loads the current fsimage
(persistent copy of file system metadata), merges in the edits log
(transaction log containing all file system metadata changes since the last
checkpoint), and then saves back a new fsimage file after that merge.
 Current versions of the Hadoop codebase will print some information to
logs about the volume of activity during this checkpointing process, so I
recommend looking for that in your logs to see if this explains it.
 Depending on whether or not your have a large number of transactions
queued since your last checkpoint, this whole process can cause namenode
startup to take several minutes.

If this becomes a regular problem, then you can run SecondaryNameNode or
BackupNode to perform periodic checkpoints in addition to the checkpoint
that occurs on namenode restart.  This is probably overkill for a dev
environment on your laptop though.

Hope this helps,

Chris Nauroth
Hortonworks
http://hortonworks.com/



On Thu, May 23, 2013 at 2:49 AM, Ted r6squee...@gmail.com wrote:

 Hi I'm running hadoop on my local laptop for development and
 everything works but there's some annoying pauses during the startup
 which causes the entire hadoop startup process to take up to 4 minutes
 and I'm wondering what it is and if I can do anything about it.

 I'm running everything on 1 machines, on fedora linux, hadoop-1.1.2,
 oracle jkd1.7.0_17, the machine is a dual core i5, and I have 8gb of
 ram and an SSD so it shouldn't be slow.

 When the system pauses, there is no cpu usage, no disk usage and no
 network usage (although I suspect it's waiting for the network to
 resolve or return something).

 Here's some snippets from the namenode logs during startup where you
 can see it just pauses for around 30 seconds or more with out errors
 or anything :

 ...
 2013-05-23 19:26:37,660 INFO
 org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
 hadoop-metrics2.properties
 2013-05-23 19:26:37,676 INFO
 org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
 MetricsSystem,sub=Stats registered.
 2013-05-23 19:27:54,144 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
 period at 10 second(s).
 2013-05-23 19:27:54,144 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics
 system started
 ...
 2013-05-23 19:27:54,341 WARN
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: The
 dfs.support.append option is in your configuration, however append is
 not supported. This configuration option is no longer required to
 enable sync.
 2013-05-23 19:27:54,341 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
 accessTokenLifetime=0 min(s)
 2013-05-23 19:28:19,918 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStateMBean and NameNodeMXBean
 2013-05-23 19:28:19,937 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
 occuring more than 10 times
 ...
 2013-05-23 19:28:26,801 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 28 on 9000: starting
 2013-05-23 19:28:26,833 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 31 on 9000: starting
 2013-05-23 19:30:10,644 INFO org.apache.hadoop.hdfs.StateChange:
 BLOCK* NameSystem.registerDatanode: node registration from
 127.0.0.1:50010 storage DS-651015167-192.168.1.5-50010-1369140176513
 2013-05-23 19:30:10,650 INFO org.apache.hadoop.net.NetworkTopology:
 Adding a new node: /default-rack/127.0.0.1:50010


 I already start the system with : export
 HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
 I only allocate : export HADOOP_HEAPSIZE=512 (but it's an empty hadoop
 system, maybe just 1 or 2 test files less than 100k, and there's no
 CPU usage so it doesn't look like it's GC thrashing)

 I should mention again, there's no errors and the system runs fine and
 relatively speedy once started (considering it's on my laptop).

 Does anyone know what's causing these pauses? (and how I can get rid of
 them)
 Thanks.
 --
 Ted.



Re: Is there a way to limit # of hadoop tasks per user at runtime?

2013-05-23 Thread Harsh J
The only pain point I'd find with CS in a multi-user environment is its
limitation of using queue configs. Its non-trivial to configure a queue per
user as CS doesn't provide any user level settings (it wasn't designed for
that initially), while in FS you get user level limiting settings for
free, while also being able to specify pools (for users, or generally for a
property, such as queues).


On Thu, May 23, 2013 at 10:55 PM, Amal G Jose amalg...@gmail.com wrote:

 You can use capacity scheduler also. In that you can create some queues,
 each of specific capacity. Then you can submit jobs to that specific queue
 at runtime or you can configure it as direct submission.


 On Wed, May 22, 2013 at 3:27 AM, Sandy Ryza sandy.r...@cloudera.comwrote:

 Hi Mehmet,

 Are you using MR1 or MR2?

 The fair scheduler, present in both versions, but configured slightly
 differently, allows you to limit the number of map and reduce tasks in a
 queue.  The configuration can be updated at runtime by modifying the
 scheduler's allocations file.  It also has a feature that automatically
 maps jobs to queues based on the user submitted them.

 Here are links to documentation in MR1 and MR2:
 http://hadoop.apache.org/docs/stable/fair_scheduler.html

 http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html

 -Sandy



 On Tue, May 21, 2013 at 2:43 PM, Mehmet Belgin 
 mehmet.bel...@oit.gatech.edu wrote:

 Hi Everyone,

 I was wondering if there is a way for limiting the number of tasks
 (map+reduce) *per user* at runtime? Using an environment variable perhaps?
 I am asking this from a resource provisioning perspective. I am trying to
 come up with a N-token licensing system for multiple users to use our
 limited hadoop resources simultaneously. That is, when user A checks out 6
 tokens,  he/she can only run 6 hadoop tasks.

 If there is no such thing in hadoop, has anyone tried to integrate
 hadoop with torque/moab (or any other RM or scheduler)? Any advice in that
 direction will be appreciated :)

 Thanks in advance,
 -Mehmet











-- 
Harsh J


Re: Hadoop Installation Mappers setting

2013-05-23 Thread Jitendra Yadav
Hi,

Thanks for your clarification.

I have one more question.

How does cores factor influence slots calculation?

Thanks~

On 5/23/13, Amal G Jose amalg...@gmail.com wrote:
 I am explaining it more.
 If your machine have 8 GB of memory.
 After reserving to Operating system and all other processes except
 tasktracker, you have 4 GB remaining(assume).
 The remaining process running is tasktracker.
 If the child jvm size is 200 MB,
 Then you can define a maximum slots of 4*1024 MB/ 200 MB
 Which is approximately 20.
 You can divide the slots into mapper and reducer slots as per your
 requirement.
 This is just an example that I explained based on my knowledge.



 On Thu, May 23, 2013 at 7:48 PM, bejoy.had...@gmail.com wrote:

 **
 Hi

 I assume the question is on how many slots.

 It dependents on
 - the child/task jvm size and the available memory.
 - available number of cores



 Your available memory for tasks is total memory - memory used for OS and
 other services running on your box.

 Other services include non hadoop services as well as hadoop daemons.

 Divide the available memory with child jvm size and that would get the
 max
 num of slots.

 Also check whether sufficient number of cores are available as well.


 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Jitendra Yadav jeetuyadav200...@gmail.com
 *Date: *Thu, 23 May 2013 18:10:38 +0530
 *To: *user@hadoop.apache.org
 *ReplyTo: * user@hadoop.apache.org
 *Subject: *Hadoop Installation Mappers setting

 Hi,

 While installing hadoop cluster, how we can calculate the exact number of
 mappers value.


 Thanks~




Re: HDFS data and non-aligned splits

2013-05-23 Thread Harsh J
 What happens when MR produces data splits, and those splits don’t align
on block boundaries?

Answer depends on the file format used here. With any of the formats we
ship, nothing happens.

 but isn’t there always some slop where records straddle the block
boundaries, resulting in an extra HDFS connection just to get the
half-record in the other block?

Yes, but how large is half (or in worst case, the whole) record going to be
in size?

 Does this impact performance?

Its more of an extra, minor DN connection. The perf impact is almost zero
but the format-free loading is a major win in operations. Comparing to
Disco's DDFS for one alternative example, HDFS is much easier here. With
Disco you have to manage your chunking during load time, while with HDFS,
MR libraries need logic based on
http://wiki.apache.org/hadoop/HadoopMapReduce to process those records. You
would at most, depending on how large the records are of course, spend
reading from a few bytes to a few megabytes over the network. If you use
large record sizes, its also a good thing to raise up the file's block size.

 Are there file formats that attempt to enforce data alignment?

I don't think there are any, and there shouldn't be, cause reading them
beyond split boundaries is pretty transparent to application writers. Your
HDFS reader API doesn't require you to be aware of the split.


On Thu, May 23, 2013 at 11:23 PM, John Lilley john.lil...@redpoint.netwrote:

  What happens when MR produces data splits, and those splits don’t align
 on block boundaries?  I’ve read that MR will attempt to make data splits
 near block boundaries to improve data locality, but isn’t there always some
 slop where records straddle the block boundaries, resulting in an extra
 HDFS connection just to get the half-record in the other block?  Does this
 impact performance?  Are there file formats that attempt to enforce data
 alignment?

 ** **




-- 
Harsh J


Re: Hadoop Installation Mappers setting

2013-05-23 Thread bejoy . hadoop
When you take a mapreduce tasks, you need CPU cycles to do the processing, not 
just memory.

So ideally based on the processor type(hyperthreaded or not) compute the 
available cores. Then may be compute as, one core for each task slot.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Jitendra Yadav jeetuyadav200...@gmail.com
Date: Fri, 24 May 2013 00:26:29 
To: user@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Re: Hadoop Installation Mappers setting

Hi,

Thanks for your clarification.

I have one more question.

How does cores factor influence slots calculation?

Thanks~

On 5/23/13, Amal G Jose amalg...@gmail.com wrote:
 I am explaining it more.
 If your machine have 8 GB of memory.
 After reserving to Operating system and all other processes except
 tasktracker, you have 4 GB remaining(assume).
 The remaining process running is tasktracker.
 If the child jvm size is 200 MB,
 Then you can define a maximum slots of 4*1024 MB/ 200 MB
 Which is approximately 20.
 You can divide the slots into mapper and reducer slots as per your
 requirement.
 This is just an example that I explained based on my knowledge.



 On Thu, May 23, 2013 at 7:48 PM, bejoy.had...@gmail.com wrote:

 **
 Hi

 I assume the question is on how many slots.

 It dependents on
 - the child/task jvm size and the available memory.
 - available number of cores



 Your available memory for tasks is total memory - memory used for OS and
 other services running on your box.

 Other services include non hadoop services as well as hadoop daemons.

 Divide the available memory with child jvm size and that would get the
 max
 num of slots.

 Also check whether sufficient number of cores are available as well.


 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Jitendra Yadav jeetuyadav200...@gmail.com
 *Date: *Thu, 23 May 2013 18:10:38 +0530
 *To: *user@hadoop.apache.org
 *ReplyTo: * user@hadoop.apache.org
 *Subject: *Hadoop Installation Mappers setting

 Hi,

 While installing hadoop cluster, how we can calculate the exact number of
 mappers value.


 Thanks~




Re: SequenceFile sync marker uniqueness

2013-05-23 Thread Harsh J
SequenceFiles use a 16 digit MD5 (computed based on a UID and writer ~init
time, so pretty random). For the rest of my answer, I'll prefer not to
repeat what Martin's already said very well here:
http://search-hadoop.com/m/VYVra2krg5t1 (point #2) over the Avro lists for
the Avro DataFile format which uses a similar technique.


On Thu, May 23, 2013 at 11:34 PM, John Lilley john.lil...@redpoint.netwrote:

  How does SequenceFile guarantee that the sync marker does not appear in
 the data?

 John

 ** **




-- 
Harsh J


HTTP file server, map output, and other files

2013-05-23 Thread John Lilley
Thanks to previous kind answers and more reading in the elephant book, I now 
understand that mapper tasks place partitioned results into local files that 
are served up to reducers via HTTP:

The output file's partitions are made available to the reducers over HTTP. The 
maximum number of worker threads used to serve the file partitions is 
controlled by the tasktracker.http.threads property; this setting is per 
tasktracker, not per map task slot. The default of 40 may need to be increased 
for large clusters running large jobs. In MapReduce 2, this property is not 
applicable because the maximum number of threads used is set automatically 
based on the number of processors on the machine. (MapReduce 2 uses Netty, 
which by default allows up to twice as many threads as there are processors.)

My question is, for a custom (non-MR) application under YARN, how would I set 
up my application tasks' output data to be served over HTTP?  Is there an API 
to control this, or are there predefined local folders that will be served up?  
Once I am finished with the temporary data, how do I request that the files are 
removed?

Thanks
John



Re: Hive tmp logs

2013-05-23 Thread Sanjay Subramanian
Clarification
This property defines a file on HDFS
property
  namehive.exec.scratchdir/name
  value /data01/workspace/hive scratch/dir/on/local/linux/disk/value
/property




From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Date: Wednesday, May 22, 2013 12:23 PM
To: u...@hive.apache.orgmailto:u...@hive.apache.org 
u...@hive.apache.orgmailto:u...@hive.apache.org
Cc: User user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Hive tmp logs

property
  namehive.querylog.location/name
  value/path/to/hivetmp/dir/on/local/linux/disk/value
/property


From: Anurag Tangri tangri.anu...@gmail.commailto:tangri.anu...@gmail.com
Reply-To: u...@hive.apache.orgmailto:u...@hive.apache.org 
u...@hive.apache.orgmailto:u...@hive.apache.org
Date: Wednesday, May 22, 2013 11:56 AM
To: u...@hive.apache.orgmailto:u...@hive.apache.org 
u...@hive.apache.orgmailto:u...@hive.apache.org
Cc: Hive u...@hive.apache.orgmailto:u...@hive.apache.org, User 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Hive tmp logs

Hi,
You can add Hive query log property in your hive site xml and point to the 
directory you want.

Thanks,
Anurag Tangri

Sent from my iPhone

On May 22, 2013, at 11:53 AM, Raj Hadoop 
hadoop...@yahoo.commailto:hadoop...@yahoo.com wrote:

Hi,

My hive job logs are being written to /tmp/hadoop directory. I want to change 
it to a different location i.e. a sub directory somehere under the 'hadoop' 
user home directory.
How do I change it.

Thanks,
Ra

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


hive.log

2013-05-23 Thread Sanjay Subramanian
How do I set the property in hive-site.xml that defines the local linux 
directory for hive.log ?
Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


MiniDFS Cluster log dir

2013-05-23 Thread siddhi mehta
Hey guys,

For testing purpose I am starting up a minicluster using the
http://hadoop.apache.org/docs/r1.2.0/cli_minicluster.html

I was wondering what is a good way to configure log directory for the same.
I tried setting hadoop.log.dir or yarn.log.dir but that seems to have no
effect.

I am specifically trying to access job logs.
While trying to access job logs from the job history server pages it
complains that

Logs not available for attempt_136934344_0001_r_00_0. Aggregation
may not be complete, Check back later or try the nodemanager at
localhost:62025

I do set  yarn.nodemanager.remote-app-log-dir using the -D option while
starting up the hadoop cluster but it seems like it doesnot make used of
that at all.

Any pointer to help resolve the issue.

Regards,
Siddhi Mehta


Re: hive.log

2013-05-23 Thread Sanjay Subramanian
Ok figured it out

-  vi  /etc/hive/conf/hive-log4j.properties

- Modify this line
#hive.log.dir=/tmp/${user.name}
hive.log.dir=/data01/workspace/hive/log/${user.name}


From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: u...@hive.apache.orgmailto:u...@hive.apache.org 
u...@hive.apache.orgmailto:u...@hive.apache.org
Date: Thursday, May 23, 2013 2:56 PM
To: u...@hive.apache.orgmailto:u...@hive.apache.org 
u...@hive.apache.orgmailto:u...@hive.apache.org
Cc: User user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: hive.log

How do I set the property in hive-site.xml that defines the local linux 
directory for hive.log ?
Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Child Error

2013-05-23 Thread Jim Twensky
Hello, I have a 20 node Hadoop cluster where each node has 8GB memory and
an 8-core processor. I sometimes get the following error on a random basis:


---

Exception in thread main java.io.IOException: Exception reading
file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
at 
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
at 
org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
at org.apache.hadoop.mapred.Child.main(Child.java:92)
Caused by: java.io.IOException: failure to login
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1519)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
... 2 more
Caused by: javax.security.auth.login.LoginException:
java.lang.NullPointerException: invalid null input: name
at com.sun.security.auth.UnixPrincipal.init(UnixPrincipal.java:70)
at 
com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

..

---

This does not always happen but I see a pattern when the intermediate data
is larger, it tends to occur more frequently. In the web log, I can see the
following:

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

From what I read online, a possible cause is when there is not enough
memory for all JVM's. My mapred site.xml is set up to allocate 1100MB for
each child and the maximum number of map and reduce tasks are set to 3 - So
6600MB of the child JVMs + (500MB * 2) for the data node and task tracker
(as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but I
couldn't avoid it so far.
In case it helps, here are the relevant sections of my mapred-site.xml

---

namemapred.tasktracker.map.tasks.maximum/name
value3/value

namemapred.tasktracker.reduce.tasks.maximum/name
value3/value

namemapred.child.java.opts/name
value-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/tmp/soner/value

namemapred.reduce.parallel.copies/name
value5/value

nametasktracker.http.threads/name
value80/value
---

My jobs still complete most of the time though they occasionally fail and
I'm really puzzled at this point. I'd appreciate any help or ideas.

Thanks


Re: pauses during startup (maybe network related?)

2013-05-23 Thread Ted
thanks, I'm almost 100% sure it's network related now.

What I tested was unpluggin my network :), the entire system starts in
just a few seconds.

I decided to search on reverse dns in google and I see other people
have complained about very slow reverse dns lookups (some related to
hadoop / hbase too).

I'm not sure why this is happenning yet though. I thought 127.0.0.1 or
localhost would have just resolved instantly - but it appears it's
some how finding my real IP instead, i.e. 192.168.1.5 seems to show up
in the log entries even though all my configurations say
localhost/127.0.0.1 and my /etc/hosts file has and entry for
localhost/127.0.0.1

I think if I make a /etc/hosts entry for 192.168.1.5 everything will
be quick, that's what I'm going to test later. The only problem is I'm
on an dynamic IP... I've considered just making entries for all
reasonable permutations like 192.168.1.1 through 192.168.1.20... but
I'm still more just miffed at how it's knowing I'm a 192 address when
I told it to use localhost.

On 5/24/13, Chris Nauroth cnaur...@hortonworks.com wrote:
 Hi Ted,

 2013-05-23 19:28:19,937 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
 occuring more than 10 times
 ...
 2013-05-23 19:28:26,801 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 28 on 9000: starting

 There are a couple of relevant activities that happen during namenode
 startup in between these 2 log statements.  It loads the current fsimage
 (persistent copy of file system metadata), merges in the edits log
 (transaction log containing all file system metadata changes since the last
 checkpoint), and then saves back a new fsimage file after that merge.
  Current versions of the Hadoop codebase will print some information to
 logs about the volume of activity during this checkpointing process, so I
 recommend looking for that in your logs to see if this explains it.
  Depending on whether or not your have a large number of transactions
 queued since your last checkpoint, this whole process can cause namenode
 startup to take several minutes.

 If this becomes a regular problem, then you can run SecondaryNameNode or
 BackupNode to perform periodic checkpoints in addition to the checkpoint
 that occurs on namenode restart.  This is probably overkill for a dev
 environment on your laptop though.

 Hope this helps,

 Chris Nauroth
 Hortonworks
 http://hortonworks.com/



 On Thu, May 23, 2013 at 2:49 AM, Ted r6squee...@gmail.com wrote:

 Hi I'm running hadoop on my local laptop for development and
 everything works but there's some annoying pauses during the startup
 which causes the entire hadoop startup process to take up to 4 minutes
 and I'm wondering what it is and if I can do anything about it.

 I'm running everything on 1 machines, on fedora linux, hadoop-1.1.2,
 oracle jkd1.7.0_17, the machine is a dual core i5, and I have 8gb of
 ram and an SSD so it shouldn't be slow.

 When the system pauses, there is no cpu usage, no disk usage and no
 network usage (although I suspect it's waiting for the network to
 resolve or return something).

 Here's some snippets from the namenode logs during startup where you
 can see it just pauses for around 30 seconds or more with out errors
 or anything :

 ...
 2013-05-23 19:26:37,660 INFO
 org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
 hadoop-metrics2.properties
 2013-05-23 19:26:37,676 INFO
 org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
 MetricsSystem,sub=Stats registered.
 2013-05-23 19:27:54,144 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
 period at 10 second(s).
 2013-05-23 19:27:54,144 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics
 system started
 ...
 2013-05-23 19:27:54,341 WARN
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: The
 dfs.support.append option is in your configuration, however append is
 not supported. This configuration option is no longer required to
 enable sync.
 2013-05-23 19:27:54,341 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
 accessTokenLifetime=0 min(s)
 2013-05-23 19:28:19,918 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStateMBean and NameNodeMXBean
 2013-05-23 19:28:19,937 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
 occuring more than 10 times
 ...
 2013-05-23 19:28:26,801 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 28 on 9000: starting
 2013-05-23 19:28:26,833 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 31 on 9000: starting
 2013-05-23 19:30:10,644 INFO org.apache.hadoop.hdfs.StateChange:
 BLOCK* NameSystem.registerDatanode: node registration from
 127.0.0.1:50010 storage DS-651015167-192.168.1.5-50010-1369140176513
 2013-05-23 19:30:10,650 INFO org.apache.hadoop.net.NetworkTopology:
 Adding a new node: /default-rack/127.0.0.1:50010


 I already 

Where to begin from??

2013-05-23 Thread Lokesh Basu
Hi all,

I'm a computer science undergraduate and has recently started to explore
about Hadoop. I find it very interesting and want to get involved both as
contributor and developer for this open source project. I have been going
through many text book related to Hadoop and HDFS but still I find it very
difficult as to where should a beginner start from before writing his first
line of code as contributer or developer.

Also please tell me what are the things I compulsorily need to know before
I dive into depth of these things.

Thanking you all in anticipation.




-- 

*Lokesh Chandra Basu*
B. Tech
Computer Science and Engineering
Indian Institute of Technology, Roorkee
India(GMT +5hr 30min)
+91-8267805498


Re: splittable vs seekable compressed formats

2013-05-23 Thread Rahul Bhattacharjee
I think seeking is a property of the fs , so any file stored in hdfs is
seekable. Inputstream is seekable and outputstream isn't. FileSystem
supports seekable.

Thanks,
Rahul


On Thu, May 23, 2013 at 11:01 PM, John Lilley john.lil...@redpoint.netwrote:

  I’ve read about splittable compressed formats in Hadoop.  Are any of
 these formats also “seekable” (in other words, be able to seek to an
 absolute location in the uncompressed data).

 John

 ** **



Re: Where to begin from??

2013-05-23 Thread Chris Embree
I'll be chastised and have mean things said about me for this.

Get some experience in IT before you start looking at Hadoop.  My reasoning
is this:  If you don't know how to develop real applications in a
Non-Hadoop world, you'll struggle a lot to develop with Hadoop.

Asking what things you need to know in compulsory is like saying you want
to learn computers -- totally worthless!  Find a problem to solve and
seek to learn the tools you need to solve your problem.  Otherwise, your
learning is un-applied and somewhat useless.

Picture a recent acting school graduate how to direct the next Star Wars
movie.  It's almost like that.


On Thu, May 23, 2013 at 10:39 PM, Lokesh Basu lokesh.b...@gmail.com wrote:

 Hi all,

 I'm a computer science undergraduate and has recently started to explore
 about Hadoop. I find it very interesting and want to get involved both as
 contributor and developer for this open source project. I have been going
 through many text book related to Hadoop and HDFS but still I find it very
 difficult as to where should a beginner start from before writing his first
 line of code as contributer or developer.

 Also please tell me what are the things I compulsorily need to know before
 I dive into depth of these things.

 Thanking you all in anticipation.




 --

 *Lokesh Chandra Basu*
 B. Tech
 Computer Science and Engineering
 Indian Institute of Technology, Roorkee
 India(GMT +5hr 30min)
 +91-8267805498





Re: Where to begin from??

2013-05-23 Thread Sanjay Subramanian
I agree with Chris…don't worry about what the technology is called Hadoop , Big 
table, Lucene, Hive….Model the problem and see what the solution could 
be….that’s very important

And Lokesh please don't mind…we are writing to u perhaps stuff that u don't 
want to hear but its an important real perspective

To illustrate what I mean let me give u a few problems to think about and see 
how u would solve them….

1. Before Microsoft took over Skype at least this feature used to be there and 
the feature is like this……u type the name of a person and it used to come back 
with some search results in milliseconds often searching close to a billion 
names…….How would u design such a search architecture ?

2.  In 2012, say 50 million users (cookie based) searched Macys.com on a SALES 
weekend and say 20,000 bought $100 dollar shoes. Now this year 2013 on that 
SALES weekend 60 million users (cookie based) are buying on the website….You 
want to give a 25% extra reward to only those cookies that were from last 
year…So u are looking for an intersection set of possibly 20,000 cookies in two 
sets - 50million and 60 million…..How would u solve this problem within milli 
seconds  ?

3. Last my favorite….The Postal Services department wants to think of new 
business ideas to avoid bankruptcy…One idea I have is they have zillion small 
delivery vans that go to each street in the country….Say I lease out the space 
to BIG wireless phone providers and promise them them that I will mount 
wireless signal strength measurement systems on these vans and I will provide 
them data 3  times a day…how will u devise a solution to analyse and store data 
?

I am sure if u look around in India as well u will see a lot of situations 
where u want to solve a problem….

As Chris says , think about the problem u want to solve, then model the 
solutions and pick the best one…

On the flip side….I can tell u it will still be a few years till many Banks and 
Stock trading houses will believe in Cassandra and Hbase for OLTP because that 
data is critical……If your timeline in Facebook does not show a photo , its 
possibly OK but if your 1 million deposit I a bank does not show up for days or 
suddenly vanishes - u r possibly not going to take that lightly…..

Ok enough RAMBLING….

Good luck

sanjay



From: Chris Embree cemb...@gmail.commailto:cemb...@gmail.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org, 
ch...@embree.usmailto:ch...@embree.us 
ch...@embree.usmailto:ch...@embree.us
Date: Thursday, May 23, 2013 7:47 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Where to begin from??

I'll be chastised and have mean things said about me for this.

Get some experience in IT before you start looking at Hadoop.  My reasoning is 
this:  If you don't know how to develop real applications in a Non-Hadoop 
world, you'll struggle a lot to develop with Hadoop.

Asking what things you need to know in compulsory is like saying you want to 
learn computers -- totally worthless!  Find a problem to solve and seek to 
learn the tools you need to solve your problem.  Otherwise, your learning is 
un-applied and somewhat useless.

Picture a recent acting school graduate how to direct the next Star Wars movie. 
 It's almost like that.


On Thu, May 23, 2013 at 10:39 PM, Lokesh Basu 
lokesh.b...@gmail.commailto:lokesh.b...@gmail.com wrote:
Hi all,

I'm a computer science undergraduate and has recently started to explore about 
Hadoop. I find it very interesting and want to get involved both as contributor 
and developer for this open source project. I have been going through many text 
book related to Hadoop and HDFS but still I find it very difficult as to where 
should a beginner start from before writing his first line of code as 
contributer or developer.

Also please tell me what are the things I compulsorily need to know before I 
dive into depth of these things.

Thanking you all in anticipation.




--

Lokesh Chandra Basu
B. Tech
Computer Science and Engineering
Indian Institute of Technology, Roorkee
India(GMT +5hr 30min)
+91-8267805498tel:%2B91-8267805498




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Task attempt failed after TaskAttemptListenerImpl ping

2013-05-23 Thread YouPeng Yang
Hi hadoop users

 I find that One application filed when the  container log it shows that it
always ping [2].

How does it come out?

I'm using the YARN and MRv2(CDH-4.1.2)



[1]resourcemanager.log
2013-05-24 09:45:07,192 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done
launching container Container: [ContainerId:
container_1369298403742_0144_01_01, NodeId: wxossetl3:29984,
NodeHttpAddress: wxossetl3:8042, Resource: memory: 1536, Priority:
org.apache.hadoop.yarn.api.records.impl.pb.PriorityPBImpl@1f, State: NEW,
Token: null, Status: container_id {, app_attempt_id {, application_id {,
id: 144, cluster_timestamp: 1369298403742, }, attemptId: 1, }, id: 1, },
state: C_NEW, ] for AM appattempt_1369298403742_0144_01
2013-05-24 09:45:07,192 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1369298403742_0144_01 State change from ALLOCATED to LAUNCHED
2013-05-24 09:45:08,186 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1369298403742_0144_01_01 Container Transitioned from ACQUIRED
to RUNNING
2013-05-24 09:45:10,533 INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM
registration appattempt_1369298403742_0144_01
2013-05-24 09:45:10,533 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=hadoop IP=172.16.250.1OPERATION=Register
App Master TARGET=ApplicationMasterService RESULT=SUCCESS
APPID=application_1369298403742_0144
  APPATTEMPTID=appattempt_1369298403742_0144_01
2013-05-24 09:45:10,533 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1369298403742_0144_01 State change from LAUNCHED to RUNNING
2013-05-24 09:45:10,533 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1369298403742_0144 State change from ACCEPTED to RUNNING

[2] container syslog:


2013-05-24 10:00:10,222 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.LocalContainerLauncher: Processing the event
EventType: CONTAINER_REMOTE_LAUNCH for container
container_1369298403742_0153_01_01 taskAttempt
attempt_1369298403742_0153_m_00_0
2013-05-24 10:00:10,223 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt:
[attempt_1369298403742_0153_m_00_0] using containerId:
[container_1369298403742_0153_01_01 on NM: [wxossetl1:46256]
2013-05-24 10:00:10,223 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.LocalContainerLauncher:
mapreduce.cluster.local.dir for uber task:
/tmp/nm-local-dir/usercache/hadoop/appcache/application_1369298403742_0153
2013-05-24 10:00:10,225 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1369298403742_0153_m_00_0 TaskAttempt Transitioned from
ASSIGNED to RUNNING
2013-05-24 10:00:10,226 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1369298403742_0153_m_00 Task Transitioned from SCHEDULED to RUNNING
2013-05-24 10:00:10,237 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin :
org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@34e77781
2013-05-24 10:00:13,224 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl:
Ping from attempt_1369298403742_0153_m_00_0
2013-05-24 10:00:16,225 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1369298403742_0153_m_00_0
2013-05-24 10:00:19,225 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1369298403742_0153_m_00_0

..


Re: Where to begin from??

2013-05-23 Thread Raj Hadoop


Hi,

With all due to respect to the senior members of this site, I wanted to first 
congratulate Lokesh for his interest in Hadoop. I want to know how many fresh 
graduates are interested in this technology. I guess not many. So we have to 
welcome Lokesh to Hadoop world.

I agree to the seniors...It is good and important to know the real world 
problems 

But coming to your question - as per my knowledge - if u want to learn / shine 
in Hadoop - know the following compulsorily.
1) Linux
2) Java
3) Sql


Seniors may correct me or add or modify to the following list.


Thanks,
Raj



 From: Sanjay Subramanian sanjay.subraman...@wizecommerce.com
To: user@hadoop.apache.org user@hadoop.apache.org; ch...@embree.us 
ch...@embree.us 
Sent: Thursday, May 23, 2013 11:03 PM
Subject: Re: Where to begin from??
 


I agree with Chris…don't worry about what the technology is called Hadoop , Big 
table, Lucene, Hive….Model the problem and see what the solution could 
be….that’s very important 

And Lokesh please don't mind…we are writing to u perhaps stuff that u don't 
want to hear but its an important real perspective

To illustrate what I mean let me give u a few problems to think about and see 
how u would solve them….

1. Before Microsoft took over Skype at least this feature used to be there and 
the feature is like this……u type the name of a person and it used to come back 
with some search results in milliseconds often searching close to a billion 
names…….How would u design such a search architecture ?

2.  In 2012, say 50 million users (cookie based) searched Macys.com on a SALES 
weekend and say 20,000 bought $100 dollar shoes. Now this year 2013 on that 
SALES weekend 60 million users (cookie based) are buying on the website….You 
want to give a 25% extra reward to only those cookies that were from last 
year…So u are looking for an intersection set of possibly 20,000 cookies in two 
sets - 50million and 60 million…..How would u solve this problem within milli 
seconds  ?

3. Last my favorite….The Postal Services department wants to think of new 
business ideas to avoid bankruptcy…One idea I have is they have zillion small 
delivery vans that go to each street in the country….Say I lease out the space 
to BIG wireless phone providers and promise them them that I will mount 
wireless signal strength measurement systems on these vans and I will provide 
them data 3  times a day…how will u devise a solution to analyse and store data 
?

I am sure if u look around in India as well u will see a lot of situations 
where u want to solve a problem….

As Chris says , think about the problem u want to solve, then model the 
solutions and pick the best one…

On the flip side….I can tell u it will still be a few years till many Banks and 
Stock trading houses will believe in Cassandra and Hbase for OLTP because that 
data is critical……If your timeline in Facebook does not show a photo , its 
possibly OK but if your 1 million deposit I a bank does not show up for days or 
suddenly vanishes - u r possibly not going to take that lightly…..

Ok enough RAMBLING….

Good luck

sanjay
  

From: Chris Embree cemb...@gmail.com
Reply-To: user@hadoop.apache.org user@hadoop.apache.org, ch...@embree.us 
ch...@embree.us
Date: Thursday, May 23, 2013 7:47 PM
To: user@hadoop.apache.org user@hadoop.apache.org
Subject: Re: Where to begin from??


I'll be chastised and have mean things said about me for this. 

Get some experience in IT before you start looking at Hadoop.  My reasoning is 
this:  If you don't know how to develop real applications in a Non-Hadoop 
world, you'll struggle a lot to develop with Hadoop.

Asking what things you need to know in compulsory is like saying you want to 
learn computers -- totally worthless!  Find a problem to solve and seek to 
learn the tools you need to solve your problem.  Otherwise, your learning is 
un-applied and somewhat useless. 

Picture a recent acting school graduate how to direct the next Star Wars movie. 
 It's almost like that.



On Thu, May 23, 2013 at 10:39 PM, Lokesh Basu lokesh.b...@gmail.com wrote:

Hi all, 


I'm a computer science undergraduate and has recently started to explore about 
Hadoop. I find it very interesting and want to get involved both as 
contributor and developer for this open source project. I have been going 
through many text book related to Hadoop and HDFS but still I find it very 
difficult as to where should a beginner start from before writing his first 
line of code as contributer or developer.


Also please tell me what are the things I compulsorily need to know before I 
dive into depth of these things.  


Thanking you all in anticipation. 




-- 


Lokesh Chandra Basu

B. Tech
Computer Science and Engineering

Indian Institute of Technology, Roorkee
India(GMT +5hr 30min)
+91-8267805498




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 

Re: Hadoop Classpath issue.

2013-05-23 Thread YouPeng Yang
Hi
   You should check your /usr/bin/hadoop script.



2013/5/23 Dhanasekaran Anbalagan bugcy...@gmail.com

 Hi Guys,

 When i trying to execute hadoop fs -ls / command
 It's return extra two lines.

 226:~# hadoop fs -ls /
 *common ./*
 *lib lib*
 Found 9 items
 drwxrwxrwx   - hdfs   supergroup  0 2013-03-07 04:46 /benchmarks
 drwxr-xr-x   - hbase  hbase   0 2013-05-23 08:59 /hbase
 drwxr-xr-x   - hdfs   supergroup  0 2013-02-20 13:21 /mapred
 drwxr-xr-x   - tech   supergroup  0 2013-05-03 05:15 /test
 drwxrwxrwx   - mapred supergroup  0 2013-05-23 09:33 /tmp
 drwxrwxr-x   - hdfs   supergroup  0 2013-02-20 16:32 /user
 drwxr-xr-x   - hdfs   supergroup  0 2013-02-20 15:10 /var


 In other machines. Not return extra to lines. Please guide me how to
 remove this line.

 226:~# /usr/bin/hadoop classpath
 common ./
 lib lib

 /etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-0.20-mapreduce/./:/usr/lib/hadoop-0.20-mapreduce/lib/*:/usr/lib/hadoop-0.20-mapreduce/.//*


 Please guide me How to fix this.

 -Dhanasekaran
 Did I learn something today? If not, I wasted it.



Re: Where to begin from??

2013-05-23 Thread Lokesh Basu
First of all thank you all.

I accept that I don't know much about the real world problem and have to
begin from scratch to get some insight of what is actually driving these
technologies.

to Chris :

I will start working on finding and implementing some real world problem
and see how these things are implemented in the first place before I try to
do something out of the box.

to Sanjay :

Thank you very much for the sample problems to look into before going into
much detail about it.

to Raj :

Thank you for the appreciation and support for my attempt to learn and
implement something which is new to me. The things that you mentioned like
Linux, Java and Sql are very much familiar to me and in fact I have some
implementation experience with Sql, php, python and c++. I have made some
online event websites and made a command based Search Engine for small
scale search (without something too complex as PageRank). I also have some
experience with version control system as I was trying to qualify for GSoC
2012 (AbiWord, but was unsuccessful).

Right now I just need something like a guide that can allow me to move from
start and let me learn as much as I can, because I I'm willing to give all
the time I have to learn more and more about these things.

Thanking you all for your kind replies and support.


*Lokesh Chandra Basu*
B. Tech
Computer Science and Engineering
Indian Institute of Technology, Roorkee
India(GMT +5hr 30min)
+91-8267805498



On Fri, May 24, 2013 at 9:35 AM, Raj Hadoop hadoop...@yahoo.com wrote:


 Hi,

 With all due to respect to the senior members of this site, I wanted to
 first congratulate Lokesh for his interest in Hadoop. I want to know how
 many fresh graduates are interested in this technology. I guess not many.
 So we have to welcome Lokesh to Hadoop world.

 I agree to the seniors...It is good and important to know the real
 world problems 

 But coming to your question - as per my knowledge - if u want to learn /
 shine in Hadoop - know the following compulsorily.
 1) Linux
 2) Java
 3) Sql

 Seniors may correct me or add or modify to the following list.

 Thanks,
 Raj
  --
  *From:* Sanjay Subramanian sanjay.subraman...@wizecommerce.com
 *To:* user@hadoop.apache.org user@hadoop.apache.org; ch...@embree.us
 ch...@embree.us
 *Sent:* Thursday, May 23, 2013 11:03 PM

 *Subject:* Re: Where to begin from??

  I agree with Chris…don't worry about what the technology is called
 Hadoop , Big table, Lucene, Hive….Model the problem and see what the
 solution could be….that’s very important

  And Lokesh please don't mind…we are writing to u perhaps stuff that u
 don't want to hear but its an important real perspective

  To illustrate what I mean let me give u a few problems to think about
 and see how u would solve them….

  1. Before Microsoft took over Skype at least this feature used to be
 there and the feature is like this……u type the name of a person and it used
 to come back with some search results in milliseconds often searching close
 to a billion names…….How would u design such a search architecture ?

  2.  In 2012, say 50 million users (cookie based) searched Macys.com on a
 SALES weekend and say 20,000 bought $100 dollar shoes. Now this year 2013
 on that SALES weekend 60 million users (cookie based) are buying on the
 website….You want to give a 25% extra reward to only those cookies that
 were from last year…So u are looking for an intersection set of possibly
 20,000 cookies in two sets - 50million and 60 million…..How would u solve
 this problem within milli seconds  ?

  3. Last my favorite….The Postal Services department wants to think of
 new business ideas to avoid bankruptcy…One idea I have is they have zillion
 small delivery vans that go to each street in the country….Say I lease out
 the space to BIG wireless phone providers and promise them them that I will
 mount wireless signal strength measurement systems on these vans and I will
 provide them data 3  times a day…how will u devise a solution to analyse
 and store data ?

  I am sure if u look around in India as well u will see a lot of
 situations where u want to solve a problem….

  As Chris says , think about the problem u want to solve, then model the
 solutions and pick the best one…

  On the flip side….I can tell u it will still be a few years till many
 Banks and Stock trading houses will believe in Cassandra and Hbase for OLTP
 because that data is critical……If your timeline in Facebook does not show a
 photo , its possibly OK but if your 1 million deposit I a bank does not
 show up for days or suddenly vanishes - u r possibly not going to take that
 lightly…..

  Ok enough RAMBLING….

  Good luck

  sanjay



   From: Chris Embree cemb...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org, 
 ch...@embree.us ch...@embree.us
 Date: Thursday, May 23, 2013 7:47 PM
 To: user@hadoop.apache.org user@hadoop.apache.org
 Subject: Re: Where to begin from??

   

Re: Hadoop Classpath issue.

2013-05-23 Thread shashwat shriparv
Check your HDFS at namenode:50070 if these files are there...

*Thanks  Regards*

∞
Shashwat Shriparv



On Fri, May 24, 2013 at 9:45 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote:

 Hi
You should check your /usr/bin/hadoop script.



 2013/5/23 Dhanasekaran Anbalagan bugcy...@gmail.com

 Hi Guys,

 When i trying to execute hadoop fs -ls / command
 It's return extra two lines.

 226:~# hadoop fs -ls /
 *common ./*
 *lib lib*
 Found 9 items
 drwxrwxrwx   - hdfs   supergroup  0 2013-03-07 04:46 /benchmarks
 drwxr-xr-x   - hbase  hbase   0 2013-05-23 08:59 /hbase
 drwxr-xr-x   - hdfs   supergroup  0 2013-02-20 13:21 /mapred
 drwxr-xr-x   - tech   supergroup  0 2013-05-03 05:15 /test
 drwxrwxrwx   - mapred supergroup  0 2013-05-23 09:33 /tmp
 drwxrwxr-x   - hdfs   supergroup  0 2013-02-20 16:32 /user
 drwxr-xr-x   - hdfs   supergroup  0 2013-02-20 15:10 /var


 In other machines. Not return extra to lines. Please guide me how to
 remove this line.

 226:~# /usr/bin/hadoop classpath
 common ./
 lib lib

 /etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-0.20-mapreduce/./:/usr/lib/hadoop-0.20-mapreduce/lib/*:/usr/lib/hadoop-0.20-mapreduce/.//*


 Please guide me How to fix this.

 -Dhanasekaran
 Did I learn something today? If not, I wasted it.





Re: Task attempt failed after TaskAttemptListenerImpl ping

2013-05-23 Thread Harsh J
Assuming you mean failed there instead of filed.

In MR, a ping message is sent over the TaskUmbilicalProtocol from the
Task container to the MR AM. A ping is only sent as an alternative, to
check self, if there's no progress to report from the task. No
progress to report for a long time generally means the task has
stopped doing work/isn't updating its status/is stuck.

On Fri, May 24, 2013 at 8:46 AM, YouPeng Yang yypvsxf19870...@gmail.com wrote:
 Hi hadoop users

  I find that One application filed when the  container log it shows that it
 always ping [2].

 How does it come out?

 I'm using the YARN and MRv2(CDH-4.1.2)



 [1]resourcemanager.log
 2013-05-24 09:45:07,192 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done
 launching container Container: [ContainerId:
 container_1369298403742_0144_01_01, NodeId: wxossetl3:29984,
 NodeHttpAddress: wxossetl3:8042, Resource: memory: 1536, Priority:
 org.apache.hadoop.yarn.api.records.impl.pb.PriorityPBImpl@1f, State: NEW,
 Token: null, Status: container_id {, app_attempt_id {, application_id {, id:
 144, cluster_timestamp: 1369298403742, }, attemptId: 1, }, id: 1, }, state:
 C_NEW, ] for AM appattempt_1369298403742_0144_01
 2013-05-24 09:45:07,192 INFO
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
 appattempt_1369298403742_0144_01 State change from ALLOCATED to LAUNCHED
 2013-05-24 09:45:08,186 INFO
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
 container_1369298403742_0144_01_01 Container Transitioned from ACQUIRED
 to RUNNING
 2013-05-24 09:45:10,533 INFO
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM
 registration appattempt_1369298403742_0144_01
 2013-05-24 09:45:10,533 INFO
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop
 IP=172.16.250.1OPERATION=Register App Master TARGET=ApplicationMasterService
 RESULT=SUCCESS APPID=application_1369298403742_0144
 APPATTEMPTID=appattempt_1369298403742_0144_01
 2013-05-24 09:45:10,533 INFO
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
 appattempt_1369298403742_0144_01 State change from LAUNCHED to RUNNING
 2013-05-24 09:45:10,533 INFO
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
 application_1369298403742_0144 State change from ACCEPTED to RUNNING

 [2] container syslog:


 2013-05-24 10:00:10,222 INFO [uber-SubtaskRunner]
 org.apache.hadoop.mapred.LocalContainerLauncher: Processing the event
 EventType: CONTAINER_REMOTE_LAUNCH for container
 container_1369298403742_0153_01_01 taskAttempt
 attempt_1369298403742_0153_m_00_0
 2013-05-24 10:00:10,223 INFO [AsyncDispatcher event handler]
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt:
 [attempt_1369298403742_0153_m_00_0] using containerId:
 [container_1369298403742_0153_01_01 on NM: [wxossetl1:46256]
 2013-05-24 10:00:10,223 INFO [uber-SubtaskRunner]
 org.apache.hadoop.mapred.LocalContainerLauncher: mapreduce.cluster.local.dir
 for uber task:
 /tmp/nm-local-dir/usercache/hadoop/appcache/application_1369298403742_0153
 2013-05-24 10:00:10,225 INFO [AsyncDispatcher event handler]
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
 attempt_1369298403742_0153_m_00_0 TaskAttempt Transitioned from ASSIGNED
 to RUNNING
 2013-05-24 10:00:10,226 INFO [AsyncDispatcher event handler]
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
 task_1369298403742_0153_m_00 Task Transitioned from SCHEDULED to RUNNING
 2013-05-24 10:00:10,237 INFO [uber-SubtaskRunner]
 org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin :
 org.apache.hadoop.yarn.util.LinuxResourceCalculatorPlugin@34e77781
 2013-05-24 10:00:13,224 INFO [communication thread]
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
 attempt_1369298403742_0153_m_00_0
 2013-05-24 10:00:16,225 INFO [communication thread]
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
 attempt_1369298403742_0153_m_00_0
 2013-05-24 10:00:19,225 INFO [communication thread]
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
 attempt_1369298403742_0153_m_00_0

 ..




-- 
Harsh J


Re: pauses during startup (maybe network related?)

2013-05-23 Thread Harsh J
You are spot on about the DNS lookup slowing things down. I've faced
the same issue (before I had a local network DNS set up for the WiFi
network I use).

 but I'm still more just miffed at how it's knowing I'm a 192 address when I 
 told it to use localhost.

There's a few configs you need to additionally change to make a
perfect localhost setup. Otherwise, there are defaults in Apache
Hadoop that bind to 0.0.0.0 and report the current system hostname
(which changes if you get onto a network), causing what you're seeing.

On Fri, May 24, 2013 at 7:42 AM, Ted r6squee...@gmail.com wrote:
 thanks, I'm almost 100% sure it's network related now.

 What I tested was unpluggin my network :), the entire system starts in
 just a few seconds.

 I decided to search on reverse dns in google and I see other people
 have complained about very slow reverse dns lookups (some related to
 hadoop / hbase too).

 I'm not sure why this is happenning yet though. I thought 127.0.0.1 or
 localhost would have just resolved instantly - but it appears it's
 some how finding my real IP instead, i.e. 192.168.1.5 seems to show up
 in the log entries even though all my configurations say
 localhost/127.0.0.1 and my /etc/hosts file has and entry for
 localhost/127.0.0.1

 I think if I make a /etc/hosts entry for 192.168.1.5 everything will
 be quick, that's what I'm going to test later. The only problem is I'm
 on an dynamic IP... I've considered just making entries for all
 reasonable permutations like 192.168.1.1 through 192.168.1.20... but
 I'm still more just miffed at how it's knowing I'm a 192 address when
 I told it to use localhost.

 On 5/24/13, Chris Nauroth cnaur...@hortonworks.com wrote:
 Hi Ted,

 2013-05-23 19:28:19,937 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
 occuring more than 10 times
 ...
 2013-05-23 19:28:26,801 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 28 on 9000: starting

 There are a couple of relevant activities that happen during namenode
 startup in between these 2 log statements.  It loads the current fsimage
 (persistent copy of file system metadata), merges in the edits log
 (transaction log containing all file system metadata changes since the last
 checkpoint), and then saves back a new fsimage file after that merge.
  Current versions of the Hadoop codebase will print some information to
 logs about the volume of activity during this checkpointing process, so I
 recommend looking for that in your logs to see if this explains it.
  Depending on whether or not your have a large number of transactions
 queued since your last checkpoint, this whole process can cause namenode
 startup to take several minutes.

 If this becomes a regular problem, then you can run SecondaryNameNode or
 BackupNode to perform periodic checkpoints in addition to the checkpoint
 that occurs on namenode restart.  This is probably overkill for a dev
 environment on your laptop though.

 Hope this helps,

 Chris Nauroth
 Hortonworks
 http://hortonworks.com/



 On Thu, May 23, 2013 at 2:49 AM, Ted r6squee...@gmail.com wrote:

 Hi I'm running hadoop on my local laptop for development and
 everything works but there's some annoying pauses during the startup
 which causes the entire hadoop startup process to take up to 4 minutes
 and I'm wondering what it is and if I can do anything about it.

 I'm running everything on 1 machines, on fedora linux, hadoop-1.1.2,
 oracle jkd1.7.0_17, the machine is a dual core i5, and I have 8gb of
 ram and an SSD so it shouldn't be slow.

 When the system pauses, there is no cpu usage, no disk usage and no
 network usage (although I suspect it's waiting for the network to
 resolve or return something).

 Here's some snippets from the namenode logs during startup where you
 can see it just pauses for around 30 seconds or more with out errors
 or anything :

 ...
 2013-05-23 19:26:37,660 INFO
 org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
 hadoop-metrics2.properties
 2013-05-23 19:26:37,676 INFO
 org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
 MetricsSystem,sub=Stats registered.
 2013-05-23 19:27:54,144 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
 period at 10 second(s).
 2013-05-23 19:27:54,144 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics
 system started
 ...
 2013-05-23 19:27:54,341 WARN
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: The
 dfs.support.append option is in your configuration, however append is
 not supported. This configuration option is no longer required to
 enable sync.
 2013-05-23 19:27:54,341 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
 accessTokenLifetime=0 min(s)
 2013-05-23 19:28:19,918 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStateMBean and NameNodeMXBean
 2013-05-23 19:28:19,937 INFO
 

Hadoop 2.0.4: Unable to load native-hadoop library for your platform

2013-05-23 Thread Ben Kim
Hi I downloaded hadoop 2.0.4 and keep getting these errors from hadoop cli
and MapReduce task logs

13/05/24 14:34:17 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

i tried adding $HADOOP_HOME/lib/native/* to CLASSPATH and LD_LIBRARY_PATH
but none of these worked.

Had anyone have similar problem?

TY!

-- 

*Benjamin Kim*
*benkimkimben at gmail*


Hint on EOFException's on datanodes

2013-05-23 Thread Stephen Boesch
On a smallish (10 node) cluster with only 2 mappers per node after a few
minutes EOFExceptions are cropping up on the datanodes: an example is shown
below.

Any hint on what to tweak/change in hadoop / cluster settings to make this
more happy?


2013-05-24 05:03:57,460 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataXceiver@1b1accfc): writeBlock
blk_7760450154173670997_48372 received exception java.io.EOFException:
while trying to read 65557 bytes
2013-05-24 05:03:57,262 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder 0 for
Block blk_-3990749197748165818_48331): PacketResponder 0 for block
blk_-3990749197748165818_48331 terminating
2013-05-24 05:03:57,460 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataXceiver@1b1accfc):
DatanodeRegistration(10.254.40.79:9200,
storageID=DS-1106090267-10.254.40.79-9200-1369343833886, infoPort=9102,
ipcPort=9201):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:268)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:406)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
at java.lang.Thread.run(Thread.java:662)
2013-05-24 05:03:57,261 INFO org.apache.hadoop.hdfs.server.datanode.Dat