How exactly Oozie works internally?

2013-08-12 Thread Kasa V Varun Tej
Folks,

I have been working on this oozie SSH action from past 2 days. I'm unable
to implement anything using SSH action. I'm facing some permissions issues,
so i thought if someone can provide me with some information how it
actually works, it may help me debug the issues i'm facing.

Task i want to perform is to read a file on a particular node and push
those values to email action.

Thanks,
Kasa


Re: How to import custom Python module in MapReduce job?

2013-08-12 Thread Binglin Chang
Hi,

The problem seems to caused by symlink, hadoop uses file cache, so every
file is in fact a symlink.

lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
[root@master01 tmp]# ./main.py
Traceback (most recent call last):
  File "./main.py", line 3, in ?
import lib
ImportError: No module named lib

This should be a python bug: when using import, it can't handle symlink

You can try to use a directory containing lib.py and use -cacheArchive, so
the symlink actually links to a directory, python may handle this case well.

Thanks,
Binglin



On Mon, Aug 12, 2013 at 2:50 PM, Andrei  wrote:

> (cross-posted from 
> StackOverflow
> )
>
> I have a MapReduce job defined in file *main.py*, which imports module lib 
> from
> file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop
> cluster as follows:
>
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>
> -files lib.py,main.py
> -mapper "./main.py map" -reducer "./main.py reduce"
> -input input -output output
>
>  In my understanding, this should put both main.py and lib.py into 
> *distributed
> cache folder* on each computing machine and thus make module lib available
> to main. But it doesn't happen - from log file I see, that files *are
> really copied* to the same directory, but main can't import lib, throwing*
> ImportError*.
>
> Adding current directory to the path didn't work:
>
> import sys
> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>
> though, loading module manually did the trick:
>
> import imp
> lib = imp.load_source('lib', 'lib.py')
>
>  But that's not what I want. So why Python interpreter can see other .py files
> in the same directory, but can't import them? Note, I have already tried
> adding empty __init__.py file to the same directory without effect.
>
>
>


Re: DefaultResourceCalculator class not found, ResourceManager fails to start.

2013-08-12 Thread Rob Blah
Problem solved. Thank you for your help.

@Ted Yu
Other issues where my mistakes. I have a dedicated script which
updates/builds/"deploys" YARN from sources. I was starting NN with the
"-upgrade" option which unsynchronized NN version, also leading to broken
DN. Quick NN format and deletion of DN data solved the issue (I am working
on a sandbox cluster, so that is not a problem). I have modified the script
to start the NN without the upgrade option.

Two quick questions:
- when should I use the NN upgrade option, should it be only used to
upgrade NN between new version (example 2.0.4 -> 2.0.5). How can I automate
this process?
- Is the "design/functionality" of my magical script correct? How can I
avoid future problems like the solved one?

YARN update script
- update src to trunk (opt)
- package YARN
- build dist (tar ball)
- unpack new_dist
- overwrite new_dist conf with prev_dist conf (this has lead to the problem
with DefaultResourceCalculator, my conf is bare minimum to work in
pseudo-distributed mode)
- start YARN

For any suggestions I would be grateful.

regards
tmp



2013/8/12 Ted Yu 

> Can you check the config entry
> for yarn.scheduler.capacity.resource-calculator ?
> It should point
> to org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
>
> bq. I was able to fix all issues
>
> What other issues came up ?
>
> Thanks
>
>
> On Sun, Aug 11, 2013 at 2:07 PM, Rob Blah  wrote:
>
>> Hi again
>>
>> From a little investigation I have performed I have observed the
>> following. I assume the module responsible for this class is
>> hadoop-yarn-common.
>>
>> During RM init it crashes since it is looking for a class
>> DefaultResourceCalculator in
>> org.apache.hadoop.yarn.server.resourcemanager.resource.DefaultResourceCalculator,
>> while the class is present in hadoop-yarn-common-3.0.0-SNAPSHOT.jar but
>> under org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator. Thus
>> the RM crashes. Anybody knows how I can fix this? I would very grateful for
>> any help.
>>
>> regards
>> tmp
>>
>>
>> 2013/8/11 Rob Blah 
>>
>>> Hi
>>>
>>> I have a strange problem, regarding missing class, the
>>> DefaultResourceCalculator. I have a single node sandbox cluster working in
>>> a pseudo-distributed mode. The cluster was working fine yesterday, however
>>> today it stopped working. I was able to fix all issues except the following
>>> problem in ResourceManager:
>>> 2013-08-11 12:12:42,425 FATAL
>>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error
>>> starting ResourceManager
>>> java.lang.RuntimeException: java.lang.RuntimeException:
>>> java.lang.ClassNotFoundException: Class
>>> org.apache.hadoop.yarn.server.resourcemanager.resource.DefaultResourceCalculator
>>> not found
>>> at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1753)
>>> at
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getResourceCalculator(CapacitySchedulerConfiguration.java:333)
>>> at
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:258)
>>> at
>>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:241)
>>> at
>>> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>>> at
>>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:826)
>>> Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
>>> Class
>>> org.apache.hadoop.yarn.server.resourcemanager.resource.DefaultResourceCalculator
>>> not found
>>> at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1721)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1745)
>>> ... 5 more
>>> Caused by: java.lang.ClassNotFoundException: Class
>>> org.apache.hadoop.yarn.server.resourcemanager.resource.DefaultResourceCalculator
>>> not found
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1625)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1719)
>>> ... 6 more
>>> 2013-08-11 12:12:42,426 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
>>> SHUTDOWN_MSG:
>>>
>>> I build YARN from sources, daily updated to the newest revision in
>>> trunk. The class DefaultResourceCalculator exists and is present in YARN's
>>> sources. I am using (currently) trunk revision 1512895. I build YARN
>>> project with the following command:
>>> mvn clean package -Pdist -Dtar -DskipTests
>>> I create tar ball with the use of script provided in the sources:
>>> dist-tar-stitching.sh (hadoop-dist/target)
>>>
>>> regards
>>> tmp
>>>
>>
>>
>


fair scheduler :: reducer preemption

2013-08-12 Thread Ravi Shetye
Hi folks
I have a hadoop cluster running Fairscheduler (
hadoop.apache.org/docs/stable/fair_scheduler.html) with preemption set to
true.
The scheduler preemption policy work well for mappers but the reducers are
not getting preempted.

Any thoughts on this?
1) is reducer preemption not supposed to work?
2) is there another parameter I have to configure to enable reducer
preemption?
3) how do i go about debugging this issue?


-- 
RAVI SHETYE


Re: How exactly Oozie works internally?

2013-08-12 Thread Wellington Chevreuil
Hi Kasa,

did you create the oozie user on the target ssh server, and does this have
all user rights to execute want it should on the target server?

Regards,
Wellington.


2013/8/12 Kasa V Varun Tej 

> Folks,
>
> I have been working on this oozie SSH action from past 2 days. I'm unable
> to implement anything using SSH action. I'm facing some permissions issues,
> so i thought if someone can provide me with some information how it
> actually works, it may help me debug the issues i'm facing.
>
> Task i want to perform is to read a file on a particular node and push
> those values to email action.
>
> Thanks,
> Kasa
>
>
>


Re: How exactly Oozie works internally?

2013-08-12 Thread Kasa V Varun Tej
Hi WC,

I'm triggering the job as root user and i want to run some command on the
edge node.
Yes i made sure of the permissions.

Thanks,
Kasa



On Mon, Aug 12, 2013 at 3:07 PM, Wellington Chevreuil <
wellington.chevre...@gmail.com> wrote:

> Hi Kasa,
>
> did you create the oozie user on the target ssh server, and does this have
> all user rights to execute want it should on the target server?
>
> Regards,
> Wellington.
>
>
> 2013/8/12 Kasa V Varun Tej 
>
>> Folks,
>>
>> I have been working on this oozie SSH action from past 2 days. I'm unable
>> to implement anything using SSH action. I'm facing some permissions issues,
>> so i thought if someone can provide me with some information how it
>> actually works, it may help me debug the issues i'm facing.
>>
>> Task i want to perform is to read a file on a particular node and push
>> those values to email action.
>>
>> Thanks,
>> Kasa
>>
>>
>>
>


Re: How to import custom Python module in MapReduce job?

2013-08-12 Thread Andrei
Hi Binglin,

thanks for your explanation, now it makes sense. However, I'm not sure how
to implement suggested method with.

First of all, I found out that `-cachArchive` option is deprecated, so I
had to use `-archives` instead. I put my `lib.py` to directory `lib` and
then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
 linked it in call to Streaming API as follows:

  hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files main.py
*-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper "./main.py map"
-reducer "./main.py reduce" -combiner "./main.py combine" -input input
-output output

But script failed, and from logs I see that lib.jar hasn't been unpacked.
What am I missing?




On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang  wrote:

> Hi,
>
> The problem seems to caused by symlink, hadoop uses file cache, so every
> file is in fact a symlink.
>
> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
> [root@master01 tmp]# ./main.py
> Traceback (most recent call last):
>   File "./main.py", line 3, in ?
> import lib
> ImportError: No module named lib
>
> This should be a python bug: when using import, it can't handle symlink
>
> You can try to use a directory containing lib.py and use -cacheArchive,
> so the symlink actually links to a directory, python may handle this case
> well.
>
> Thanks,
> Binglin
>
>
>
> On Mon, Aug 12, 2013 at 2:50 PM, Andrei  wrote:
>
>> (cross-posted from 
>> StackOverflow
>> )
>>
>> I have a MapReduce job defined in file *main.py*, which imports module
>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>> Hadoop cluster as follows:
>>
>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>
>> -files lib.py,main.py
>> -mapper "./main.py map" -reducer "./main.py reduce"
>> -input input -output output
>>
>>  In my understanding, this should put both main.py and lib.py into 
>> *distributed
>> cache folder* on each computing machine and thus make module lib available
>> to main. But it doesn't happen - from log file I see, that files *are
>> really copied* to the same directory, but main can't import lib, throwing
>> *ImportError*.
>>
>> Adding current directory to the path didn't work:
>>
>> import sys
>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>
>> though, loading module manually did the trick:
>>
>> import imp
>> lib = imp.load_source('lib', 'lib.py')
>>
>>  But that's not what I want. So why Python interpreter can see other .py 
>> files
>> in the same directory, but can't import them? Note, I have already tried
>> adding empty __init__.py file to the same directory without effect.
>>
>>
>>
>


Re: How to import custom Python module in MapReduce job?

2013-08-12 Thread Binglin Chang
Maybe you doesn't specify symlink name in you cmd line, so the symlink name
will be just lib.jar, so I am not sure how you import lib module in your
main.py file.
Please try this:
put main.py lib.py in same jar file, e.g.  app.zip
*-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
map" -reducer "app/main.py reduce"
in main.py:
import app.lib
or:
import .lib




On Mon, Aug 12, 2013 at 6:01 PM, Andrei  wrote:

> Hi Binglin,
>
> thanks for your explanation, now it makes sense. However, I'm not sure how
> to implement suggested method with.
>
> First of all, I found out that `-cachArchive` option is deprecated, so I
> had to use `-archives` instead. I put my `lib.py` to directory `lib` and
> then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
>  linked it in call to Streaming API as follows:
>
>   hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files
> main.py *-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper
> "./main.py map" -reducer "./main.py reduce" -combiner "./main.py combine"
> -input input -output output
>
> But script failed, and from logs I see that lib.jar hasn't been unpacked.
> What am I missing?
>
>
>
>
> On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang wrote:
>
>> Hi,
>>
>> The problem seems to caused by symlink, hadoop uses file cache, so every
>> file is in fact a symlink.
>>
>> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
>> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
>> [root@master01 tmp]# ./main.py
>> Traceback (most recent call last):
>>   File "./main.py", line 3, in ?
>> import lib
>> ImportError: No module named lib
>>
>> This should be a python bug: when using import, it can't handle symlink
>>
>> You can try to use a directory containing lib.py and use -cacheArchive,
>> so the symlink actually links to a directory, python may handle this case
>> well.
>>
>> Thanks,
>> Binglin
>>
>>
>>
>> On Mon, Aug 12, 2013 at 2:50 PM, Andrei wrote:
>>
>>> (cross-posted from 
>>> StackOverflow
>>> )
>>>
>>> I have a MapReduce job defined in file *main.py*, which imports module
>>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>>> Hadoop cluster as follows:
>>>
>>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>
>>> -files lib.py,main.py
>>> -mapper "./main.py map" -reducer "./main.py reduce"
>>> -input input -output output
>>>
>>>  In my understanding, this should put both main.py and lib.py into 
>>> *distributed
>>> cache folder* on each computing machine and thus make module lib available
>>> to main. But it doesn't happen - from log file I see, that files *are
>>> really copied* to the same directory, but main can't import lib,
>>> throwing*ImportError*.
>>>
>>> Adding current directory to the path didn't work:
>>>
>>> import sys
>>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>>
>>> though, loading module manually did the trick:
>>>
>>> import imp
>>> lib = imp.load_source('lib', 'lib.py')
>>>
>>>  But that's not what I want. So why Python interpreter can see other .py 
>>> files
>>> in the same directory, but can't import them? Note, I have already tried
>>> adding empty __init__.py file to the same directory without effect.
>>>
>>>
>>>
>>
>


when Standby Namenode is doing checkpoint, the Active NameNode is slow.

2013-08-12 Thread lei liu
When Standby Namenode is doing checkpoint,  upload the image file to Active
NameNode, the Active NameNode is very slow. What is reason result to the
Active NameNode is slow?


Thanks,

LiuLei


What is the resolution for HADOOP-9346

2013-08-12 Thread Sathwik B P
Hi Guys,

I upgraded protoc to 2.5.0 in order to build another apache project.

Now I am not able to build hadoop-common trunk.

What is the resolution for HADOOP-9346?

regards,
sathwik


Re: How exactly Oozie works internally?

2013-08-12 Thread Wellington Chevreuil
I had similar issue before... I'm not sure, but I think in my case oozie
was always connecting through ssh as oozie user, event if I was running it
as a different user. If it's not a big effort to you, I would recommend you
to try create oozie usr in your edge node and give it all required rights
to perform your action.

Cheers,
Wellington.


2013/8/12 Kasa V Varun Tej 

> Hi WC,
>
> I'm triggering the job as root user and i want to run some command on the
> edge node.
> Yes i made sure of the permissions.
>
> Thanks,
> Kasa
>
>
>
> On Mon, Aug 12, 2013 at 3:07 PM, Wellington Chevreuil <
> wellington.chevre...@gmail.com> wrote:
>
>> Hi Kasa,
>>
>> did you create the oozie user on the target ssh server, and does this
>> have all user rights to execute want it should on the target server?
>>
>> Regards,
>> Wellington.
>>
>>
>> 2013/8/12 Kasa V Varun Tej 
>>
>>> Folks,
>>>
>>> I have been working on this oozie SSH action from past 2 days. I'm
>>> unable to implement anything using SSH action. I'm facing some permissions
>>> issues, so i thought if someone can provide me with some information how it
>>> actually works, it may help me debug the issues i'm facing.
>>>
>>> Task i want to perform is to read a file on a particular node and push
>>> those values to email action.
>>>
>>> Thanks,
>>> Kasa
>>>
>>>
>>>
>>
>


Re: What is the resolution for HADOOP-9346

2013-08-12 Thread Sathwik B P
Seems like the hadoop builds are failing for the same reason
https://builds.apache.org/job/Hadoop-Yarn-trunk/299/

Is there a fix coming soon?
Do we fall back on protoc 2.4.1?

On Mon, Aug 12, 2013 at 10:38 AM, Sathwik B P  wrote:

> Hi Guys,
>
> I upgraded protoc to 2.5.0 in order to build another apache project.
>
> Now I am not able to build hadoop-common trunk.
>
> What is the resolution for HADOOP-9346?
>
> regards,
> sathwik
>


Re: How to import custom Python module in MapReduce job?

2013-08-12 Thread Andrei
For some reason using -archives option leads to "Error in configuring
object" without any further information. However, I found out that -files
option works pretty well for this purpose. I was able to run my example as
follows.

1. I put `main.py` and `lib.py` into `app` directory.
2. In `main.py` I used `lib.py` directly, that is, import string is just

import lib

3. Instead of uploading to HDFS and using -archives option I just pointed
to `app` directory in -files option:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar *-files
app*-mapper "
*app/*main.py map" -reducer "*app/*main.py reduce" -input input -output
output

It did the trick. Note, that I tested with both - CPython (2.6) and PyPy
(1.9), so I think it's quite safe to assume this way correct for Python
scripts.

Thanks for your help, Binglin, without it I wouldn't be able to figure it
out anyway.




On Mon, Aug 12, 2013 at 1:12 PM, Binglin Chang  wrote:

> Maybe you doesn't specify symlink name in you cmd line, so the symlink
> name will be just lib.jar, so I am not sure how you import lib module in
> your main.py file.
> Please try this:
> put main.py lib.py in same jar file, e.g.  app.zip
> *-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
> map" -reducer "app/main.py reduce"
> in main.py:
> import app.lib
> or:
> import .lib
>
>


Re: Hosting Hadoop

2013-08-12 Thread alex bohr
I've had good experience running a large hadoop cluster on EC2 instances.
 After almost 1 year we haven't had any significant down time, just lost a
small # of data nodes.
I don't think EMR is an ideal solution if your cluster will be running 24/7.

But for running a large cluster, I don't see how you it's more cost
efficient to run in the cloud than to own the hardware and we're trying to
move off the cloud onto our own hardware.  Can I ask why you're looking to
move to the cloud?


On Fri, Aug 9, 2013 at 10:42 AM, Nitin Pawar wrote:

> check altiscale as well
>
>
> On Fri, Aug 9, 2013 at 3:05 AM, Dhaval Shah 
> wrote:
>
>> Thanks for the list Marcos. I will go through the slides/links. I think
>> that's helpful
>>
>> Regards,
>> Dhaval
>>
>>   --
>>  *From:* Marcos Luis Ortiz Valmaseda 
>> *To:* Dhaval Shah 
>> *Cc:* user@hadoop.apache.org
>> *Sent:* Thursday, 8 August 2013 4:50 PM
>> *Subject:* Re: Hosting Hadoop
>>
>> Well, all depends, because many companies use Cloud Computing
>> platforms like Amazon EMR. Vmware, Rackscpace Cloud for Hadoop
>> hosting:
>> http://aws.amazon.com/elasticmapreduce
>> http://www.vmware.com/company/news/releases/vmw-mapr-hadoop-062013.html
>> http://bitrefinery.com/services/hadoop-hosting
>> http://www.joyent.com/products/compute-service/features/hadoop
>>
>> There a lot of companies using HBase hosted in Cloud. The last
>> HBaseCon was full of great use-cases:
>> HBase at Pinterest:
>> http://www.hbasecon.com/sessions/apache-hbase-operations-at-pinterest/
>>
>> HBase at Groupon
>> http://www.hbasecon.com/sessions/apache-hbase-at-groupon/
>>
>> A great talk by Benoit for Networking design for HBase:
>>
>> http://www.hbasecon.com/sessions/scalable-network-designs-for-apache-hbase/
>>
>> Using Coprocessors to Index Columns in an Elasticsearch Cluster
>> http://www.hbasecon.com/sessions/using-coprocessors-to-index-columns/
>>
>> 2013/8/8, Dhaval Shah :
>> > We are exploring the possibility of hosting Hadoop outside of our data
>> > centers. I am aware that Hadoop in general isn't exactly designed to
>> run on
>> > virtual hardware. So a few questions:
>> > 1. Are there any providers out there who would host Hadoop on dedicated
>> > physical hardware?
>> > 2. Has anyone had success hosting Hadoop on virtualized hardware where
>> 100%
>> > uptime and performance/stability are very important (we use HBase as a
>> real
>> > time database and it needs to be up all the time)?
>> >
>> > Thanks,
>> > Dhaval
>>
>>
>> --
>> Marcos Ortiz Valmaseda
>> Product Manager at PDVSA
>> http://about.me/marcosortiz
>>
>>
>>
>
>
> --
> Nitin Pawar
>


How to tune fileSystem.listFiles("/", true) if you like walk though almost all files

2013-08-12 Thread Christian Schneider
Hi, is there a way to tune this?

I walk though the files with:

RemoteIterator listFiles = fileSystem.listFiles(new
Path(uri), true);
while(listFiles.hasNext()) {
listFiles.next();
};

I need to get some information about those files, therefore i like to scan
them all.

Is there any way to tune the listFiles.next() call. Like loading a bunch of
files, or multithreading it?

Best Regards,
Christian.


Re: Jobtracker page hangs ..again.

2013-08-12 Thread Patai Sangbutsarakum
Ok, after some sweat, i think I found the root cause but still need another
team to help me fix it.
It lines on the DNS.  Somehow each of the tip:task line, through the
tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.

2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_00 has split on node:/rack1/host1

127 ms

2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_00 has split on node:/rack1/host2

126 ms

2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
tip:task_201308111631_0006_m_00 has split on node:/rack2/host3


20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)

20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ ? host1. (37)

20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)

20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)


20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)

20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ ? host2. (37)

20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy (189)

20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)


20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)

20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ ? host3. (37)

20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz (189)

I looked at the jobtracker log in other datacenter when submitted with the
same query. Timestamp in each tip:task line is much much faster.

The question that raise here is the job initialization is really request
the DNS, if so is there any way to suppress that. topology file is already
in place where name and ip are already there.


Hope this make sense

Patai




On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
silvianhad...@gmail.com> wrote:

> Appreciate your input Bryant, i will try to reproduce and see the namenode
> log before, while, and after it pause.
> Wish me luck
>
>
> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
> bbeaudrea...@hubspot.com> wrote:
>
>> When I've had problems with a slow jobtracker, i've found the issue to be
>> one of the following two (so far) possibilities:
>>
>> - long GC pause (I'm guessing this is not it based on your email)
>> - hdfs is slow
>>
>> I haven't dived into the code yet, but circumstantially I've found that
>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>> larger and larger jobs, aside form the size of the splits serialization in
>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>> your jobtracker.  This affects other jobs being able to submit, as well as
>> the 50030 web ui.
>>
>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>> do you see a corresponding pause in the namenode logs?  What gets spewed
>> before and after that pause?
>>
>>
>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>> silvianhad...@gmail.com> wrote:
>>
>>> A while back, i was fighting with the jobtracker page hangs when i
>>> browse to http://jobtracker:50030 browser doesn't show jobs info as
>>> usual which ends up because of allowing too much job history kept in
>>> jobtracker.
>>>
>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>> around 6k-ish mappers)
>>> Job initialization was success, and finished the job.
>>>
>>> However, before the job is actually running, i can't access to the
>>> jobtracker page anymore same symptom as above.
>>>
>>> i see bunch of this in jobtracker log
>>>
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>> ..
>>> ..
>>> ..
>>>
>>> Until i see this
>>>
>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>> LOCALITY_WAIT_FACTOR=1.0
>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: Job
>>> job_201307291733_0619 initialized successfully with 76797 map tasks and 10
>>> reduce tasks.
>>>
>>> that's when i can access to the jobtracker page again.
>>>
>>>
>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>> 1ish gig from 40
>>> network bandwidth is far from filled up.
>>>
>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>> Environment (build 1.6.0_32-b05)
>>>
>>>
>>> What would be the root cause i should looking at or at least where to
>>> start?
>>>
>>> Thanks you in advanced
>>>
>>>
>>>
>>>
>>
>


Re: Jobtracker page hangs ..again.

2013-08-12 Thread Patai Sangbutsarakum
Update, after adjust the network routing, dns query speed is in micro sec
as suppose to be. the issue is completely solve.
Jobtracker page doesn't hang anymore when launch 100k mappers job..

Cheers,



On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum <
silvianhad...@gmail.com> wrote:

> Ok, after some sweat, i think I found the root cause but still need
> another team to help me fix it.
> It lines on the DNS.  Somehow each of the tip:task line, through the
> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>
> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_00 has split on node:/rack1/host1
>
> 127 ms
>
> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_00 has split on node:/rack1/host2
>
> 126 ms
>
> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201308111631_0006_m_00 has split on node:/rack2/host3
>
>
> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>
> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ ? host1. (37)
>
> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>
> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>
>
> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>
> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ ? host2. (37)
>
> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
> (189)
>
> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>
>
> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>
> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ ? host3. (37)
>
> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
> (189)
>
> I looked at the jobtracker log in other datacenter when submitted with the
> same query. Timestamp in each tip:task line is much much faster.
>
> The question that raise here is the job initialization is really request
> the DNS, if so is there any way to suppress that. topology file is already
> in place where name and ip are already there.
>
>
> Hope this make sense
>
> Patai
>
>
>
>
> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum <
> silvianhad...@gmail.com> wrote:
>
>> Appreciate your input Bryant, i will try to reproduce and see the
>> namenode log before, while, and after it pause.
>> Wish me luck
>>
>>
>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault <
>> bbeaudrea...@hubspot.com> wrote:
>>
>>> When I've had problems with a slow jobtracker, i've found the issue to
>>> be one of the following two (so far) possibilities:
>>>
>>> - long GC pause (I'm guessing this is not it based on your email)
>>> - hdfs is slow
>>>
>>> I haven't dived into the code yet, but circumstantially I've found that
>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>> larger and larger jobs, aside form the size of the splits serialization in
>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>> the 50030 web ui.
>>>
>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>> before and after that pause?
>>>
>>>
>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum <
>>> silvianhad...@gmail.com> wrote:
>>>
 A while back, i was fighting with the jobtracker page hangs when i
 browse to http://jobtracker:50030 browser doesn't show jobs info as
 usual which ends up because of allowing too much job history kept in
 jobtracker.

 Currently, i am setting up a new cluster 40g heap on the namenode and
 jobtracker in dedicated machines. Not fun part starts here; a developer
 tried to test out the cluster by launching a 76k map job (the cluster has
 around 6k-ish mappers)
 Job initialization was success, and finished the job.

 However, before the job is actually running, i can't access to the
 jobtracker page anymore same symptom as above.

 i see bunch of this in jobtracker log

 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
 tip:task_201307291733_0619_m_076796 has split on node: /rack/node
 ..
 ..
 ..

 Until i see this

 INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
 LOCALITY_WAIT_FACTOR=1.0
 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
 Job job_201307291733_0619 initialized successfully with 76797 map tasks and
 10 reduce tasks.

 that's when i can access to the jobtracker page again.


 CPU on jobtracker is very little load, JTK's Heap is far from full like
 1ish gig from 40
 network bandwidth is far from filled up.

 I'm runn

Re: How to tune fileSystem.listFiles("/", true) if you like walk though almost all files

2013-08-12 Thread Christian Schneider
Hi, i found out that it works much faster with fileSystem.listStatus() and
a recursion by hand.

listFiles= 4021 Files in 14.27 s
listStatus = 4021 Files 364.3 ms

Currently i just tested it on localhost. Tomorrow I check it against the
cluster.

public class Main
{
static AtomicInteger count = new AtomicInteger();

static URI uri;
static FileSystem fileSystem;

public static void main(final String... args) throws URISyntaxException,
IOException, InterruptedException
{
uri = new URI("/home/christian/Documents");
fileSystem = FileSystem.get(uri, new Configuration(), "hdfs");

Stopwatch stopwatch = new Stopwatch();

stopwatch.start();
withListAllAndNext();
stopwatch.stop();
System.out.println(count + " - " + stopwatch);

stopwatch.reset();
count.set(0);
 stopwatch.start();
blockwiseWithRecursion(fileSystem.listStatus(new Path(uri)));
stopwatch.stop();
System.out.println(count + " - " + stopwatch);
}

private static void blockwiseWithRecursion(FileStatus... listLocatedStatus)
throws FileNotFoundException, IOException
{
for (FileStatus fileStatus : listLocatedStatus)
{
if (fileStatus.isDirectory())
blockwiseWithRecursion(fileSystem.listStatus(fileStatus.getPath()));
else
count.incrementAndGet();
}

}

private static void withListAllAndNext() throws FileNotFoundException,
IOException
{
RemoteIterator listFiles = fileSystem.listFiles(new
Path(uri), true);
 while (true)
{
try
{
LocatedFileStatus next = listFiles.next();
count.incrementAndGet();
}
catch (IOException e)
{
System.err.println(e.getMessage());
}
catch (NoSuchElementException e)
{
break;
}
}
}
}


2013/8/12 Christian Schneider 

> Hi, is there a way to tune this?
>
> I walk though the files with:
>
> RemoteIterator listFiles = fileSystem.listFiles(new
> Path(uri), true);
> while(listFiles.hasNext()) {
> listFiles.next();
> };
>
> I need to get some information about those files, therefore i like to scan
> them all.
>
> Is there any way to tune the listFiles.next() call. Like loading a bunch
> of files, or multithreading it?
>
> Best Regards,
> Christian.
>


Re: Hardware Selection for Hadoop

2013-08-12 Thread Sambit Tripathy
Any rough ideas how much this would cost? Actually I kinda require a budget
approval and need to put some rough figures in $ on the paper. Help!

1. 6 X 2 TB hard disk JBOD, 2 quad cores, 24-48 GB RAM.
2. I rack mount unit
3. I gbe switch for the rack
4. 10 gbe switch for the network

Regards,
Sambit Tripathy.


On Tue, May 7, 2013 at 9:21 PM, Ted Dunning  wrote:

>
> On Tue, May 7, 2013 at 5:53 AM, Michael Segel 
> wrote:
>
>> While we have a rough metric on spindles to cores, you end up putting a
>> stress on the disk controllers. YMMV.
>>
>
> This is an important comment.
>
> Some controllers fold when you start pushing too much data.  Testing nodes
> independently before installation is important.
>
>


Re: Hardware Selection for Hadoop

2013-08-12 Thread Chris Embree
As we always say in Technology... it depends!

What country are you in?  That makes a difference.
How much buying power do you have?  I work for a Fortune 100 Company and we
-- absurdly -- pay about 60% off retail when we buy servers.
Are you buying a bunch at once?

You best bet is to contact 3 or 4 VAR's to get quotes.  They'll offer you
add-on services, like racking, cabling, configuring servers, etc.  You can
decide if it's worth it.

The bottom line, there is no correct answer to your question. ;)


On Mon, Aug 12, 2013 at 8:30 PM, Sambit Tripathy  wrote:

> Any rough ideas how much this would cost? Actually I kinda require a
> budget approval and need to put some rough figures in $ on the paper. Help!
>
> 1. 6 X 2 TB hard disk JBOD, 2 quad cores, 24-48 GB RAM.
> 2. I rack mount unit
> 3. I gbe switch for the rack
> 4. 10 gbe switch for the network
>
> Regards,
> Sambit Tripathy.
>
>
> On Tue, May 7, 2013 at 9:21 PM, Ted Dunning  wrote:
>
>>
>> On Tue, May 7, 2013 at 5:53 AM, Michael Segel 
>> wrote:
>>
>>> While we have a rough metric on spindles to cores, you end up putting a
>>> stress on the disk controllers. YMMV.
>>>
>>
>> This is an important comment.
>>
>> Some controllers fold when you start pushing too much data.  Testing
>> nodes independently before installation is important.
>>
>>
>


Re: Hardware Selection for Hadoop

2013-08-12 Thread Sambit Tripathy
I understand.

But sometimes there is a lock-in with a particular vendor and you are not
allowed to put the servers inside corporate data center if they are
procured from some another vendor.

The idea is to start from basic and then grow. You can tell me some numbers
in $s if you have, preferred ;), I know sometimes there are no correct
answers.

I got a quote of $4200 for  6 X 2 TB hard disk JBOD, 2 quad cores, 24-48 GB
RAM. Vendor: HP. Does this sound ok for this configuration?


On Tue, Aug 13, 2013 at 6:15 AM, Chris Embree  wrote:

> As we always say in Technology... it depends!
>
> What country are you in?  That makes a difference.
> How much buying power do you have?  I work for a Fortune 100 Company and
> we -- absurdly -- pay about 60% off retail when we buy servers.
> Are you buying a bunch at once?
>
> You best bet is to contact 3 or 4 VAR's to get quotes.  They'll offer you
> add-on services, like racking, cabling, configuring servers, etc.  You can
> decide if it's worth it.
>
> The bottom line, there is no correct answer to your question. ;)
>
>
> On Mon, Aug 12, 2013 at 8:30 PM, Sambit Tripathy wrote:
>
>> Any rough ideas how much this would cost? Actually I kinda require a
>> budget approval and need to put some rough figures in $ on the paper. Help!
>>
>> 1. 6 X 2 TB hard disk JBOD, 2 quad cores, 24-48 GB RAM.
>> 2. I rack mount unit
>> 3. I gbe switch for the rack
>> 4. 10 gbe switch for the network
>>
>> Regards,
>> Sambit Tripathy.
>>
>>
>> On Tue, May 7, 2013 at 9:21 PM, Ted Dunning wrote:
>>
>>>
>>> On Tue, May 7, 2013 at 5:53 AM, Michael Segel >> > wrote:
>>>
 While we have a rough metric on spindles to cores, you end up putting a
 stress on the disk controllers. YMMV.

>>>
>>> This is an important comment.
>>>
>>> Some controllers fold when you start pushing too much data.  Testing
>>> nodes independently before installation is important.
>>>
>>>
>>
>


Re: Jobtracker page hangs ..again.

2013-08-12 Thread Harsh J
If you're not already doing it, run a local name caching daemon (such
as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
a local cache would do good in reducing the load on your DNS.

On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
 wrote:
> Update, after adjust the network routing, dns query speed is in micro sec as
> suppose to be. the issue is completely solve.
> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>
> Cheers,
>
>
>
> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
>  wrote:
>>
>> Ok, after some sweat, i think I found the root cause but still need
>> another team to help me fix it.
>> It lines on the DNS.  Somehow each of the tip:task line, through the
>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>
>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_00 has split on node:/rack1/host1
>>
>> 127 ms
>>
>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_00 has split on node:/rack1/host2
>>
>> 126 ms
>>
>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_00 has split on node:/rack2/host3
>>
>>
>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>
>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ ? host1. (37)
>>
>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>
>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>
>>
>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>
>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ ? host2. (37)
>>
>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>> (189)
>>
>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>
>>
>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>
>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ ? host3. (37)
>>
>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>> (189)
>>
>> I looked at the jobtracker log in other datacenter when submitted with the
>> same query. Timestamp in each tip:task line is much much faster.
>>
>> The question that raise here is the job initialization is really request
>> the DNS, if so is there any way to suppress that. topology file is already
>> in place where name and ip are already there.
>>
>>
>> Hope this make sense
>>
>> Patai
>>
>>
>>
>>
>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>>  wrote:
>>>
>>> Appreciate your input Bryant, i will try to reproduce and see the
>>> namenode log before, while, and after it pause.
>>> Wish me luck
>>>
>>>
>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>>  wrote:

 When I've had problems with a slow jobtracker, i've found the issue to
 be one of the following two (so far) possibilities:

 - long GC pause (I'm guessing this is not it based on your email)
 - hdfs is slow

 I haven't dived into the code yet, but circumstantially I've found that
 when you submit a job the jobtracker needs to put a bunch of files in hdfs,
 such as the job.xml, the job jar, etc.  I'm not sure how this scales with
 larger and larger jobs, aside form the size of the splits serialization in
 the job.xml, but if your HDFS is slow for any reason it can cause pauses in
 your jobtracker.  This affects other jobs being able to submit, as well as
 the 50030 web ui.

 I'd take a look at your namenode logs.  When the jobtracker logs pause,
 do you see a corresponding pause in the namenode logs?  What gets spewed
 before and after that pause?


 On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
  wrote:
>
> A while back, i was fighting with the jobtracker page hangs when i
> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
> which ends up because of allowing too much job history kept in jobtracker.
>
> Currently, i am setting up a new cluster 40g heap on the namenode and
> jobtracker in dedicated machines. Not fun part starts here; a developer
> tried to test out the cluster by launching a 76k map job (the cluster has
> around 6k-ish mappers)
> Job initialization was success, and finished the job.
>
> However, before the job is actually running, i can't access to the
> jobtracker page anymore same symptom as above.
>
> i see bunch of this in jobtracker log
>
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
> ..
> ..
> ..
>
> Until i see this
>
> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
> LOCALITY_WAIT_FACTOR=1.0
> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
> Job job_201307291733_0619 initialized su

Re: when Standby Namenode is doing checkpoint, the Active NameNode is slow.

2013-08-12 Thread Harsh J
How large are your checkpointed fsimage files?

On Mon, Aug 12, 2013 at 3:42 PM, lei liu  wrote:
> When Standby Namenode is doing checkpoint,  upload the image file to Active
> NameNode, the Active NameNode is very slow. What is reason result to the
> Active NameNode is slow?
>
>
> Thanks,
>
> LiuLei
>



-- 
Harsh J


Re: Unable to load native-hadoop library for your platform

2013-08-12 Thread Harsh J
If you use tarballs, you will need to build native libraries for your
OS. Follow instructions for native libraries under src/BUILDING.txt of
a source tarball.

Alternatively, pick up Apache Hadoop packages from Apache Bigtop for
2.0.5-alpha and they'll come with pre-built, proper native libraries.

On Mon, Aug 12, 2013 at 12:18 PM, 小鱼儿  wrote:
> I use hadoop-2.0.5-alpha. Red Hat Enterprise Linux Server release 5.4。have
> the following warning:
> 13/08/12 13:58:57 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> How To Resolve



-- 
Harsh J


Re: when Standby Namenode is doing checkpoint, the Active NameNode is slow.

2013-08-12 Thread lei liu
The fsimage file size is 1658934155


2013/8/13 Harsh J 

> How large are your checkpointed fsimage files?
>
> On Mon, Aug 12, 2013 at 3:42 PM, lei liu  wrote:
> > When Standby Namenode is doing checkpoint,  upload the image file to
> Active
> > NameNode, the Active NameNode is very slow. What is reason result to the
> > Active NameNode is slow?
> >
> >
> > Thanks,
> >
> > LiuLei
> >
>
>
>
> --
> Harsh J
>