Re: How to setup Hive on a single node ?

2012-02-10 Thread Lac Trung
Thanks for your reply !

I've already installed Hive correctly.

First, i installed
CDH3https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation,
unfortunately i use Ubuntu Oneiric but CDH don't support Oneiric, so, i
download and install the CDH3 package for Lucid system.

Then, i installed Hadoop via the following command : ... *sudo apt-get
install hadoop-0.20 *

Next, i installed Hive via the following command : ... *sudo apt-get
install hadoop-hive*
**
After that, i installed
Mysqlhttp://ariejan.net/2007/12/12/how-to-install-mysql-on-ubuntudebian

Finally, i configure
Hivehttps://ccp.cloudera.com/display/CDHDOC/Hive+Installationand
some
variableshttps://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
 (conf/hive-site.xml)

+  Hive - Hadoop : hadoop.bin.path, hadoop.config.dir
+ Hive - Mysql : hive.metastore.warehouse.dir

Now, it run correctly.

Thanks so much !


2012/2/9 hadoop hive hadooph...@gmail.com

 hey Lac,

 its showing like you dont have DBS table in metastore(derby or mysql),
 actually you have to again install the hive or again build hive through
 ANT.

 Check you metastore(that DBS is exists or not)

 Thanks  regards
 Vikas Srivastava

 On Fri, Feb 10, 2012 at 8:33 AM, Lac Trung trungnb3...@gmail.com wrote:

  Thanks for your reply !
 
  I think i installed Hadoop correctly because i run wordcount example i
 have
  correct output. But i didn't know how to install Hive, so i installed
 Hive
  via
 https://cwiki.apache.org/confluence/display/Hive/GettingStartedinclude
  installed Hadoop 20.0 (may be not install on single node)  ^_^
  I configured like the instruction in hive-site.xml but error like this :
 
  
 
  hive show tables;
  FAILED: Error in metadata: javax.jdo.JDODataStoreException: Required
  table missing : `DBS`
  in Catalog  Schema .
  DataNucleus requires this table to perform its persistence operations.
  Either your MetaData
  is incorrect,
  or you need to enable datanucleus.autoCreateTables
  NestedThrowables:
  org.datanucleus.store.rdbms.exceptions.MissingTableException: Required
  table missing : `DBS`
  in Catalog  Schema .
  DataNucleus requires this table to perform its persistence operations.
  Either your MetaData
  is incorrect,
  or you need to enable datanucleus.autoCreateTables
  FAILED: Execution Error, return code 1 from
  org.apache.hadoop.hive.ql.exec.DDLTask
  
 
  I didn't know what to do so I reinstall Ubuntu to install from start and
  hope that someone can show me the way to do.
 
  --
  Lạc Trung
 




-- 
Lạc Trung
20083535


Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Rob Stewart
I'm looking to clarify the relationship between
MultithreadedMapper.setNumberOfThreads(i) and
mapreduce.tasktracker.map.tasks.maximum .

If I set:
- MultithreadedMapper.setNumberOfThreads( 4 )
- mapreduce.tasktracker.map.tasks.maximum = 1

Will 4 map tasks be executed in four separate threads within one JVM ?
Or are the number of threads also restricted by the map.tasks.maximum
parameter?

What about if I set:
- MultithreadedMapper.setNumberOfThreads( 4 )
- mapreduce.tasktracker.map.tasks.maximum = 4

Will this mean that 4 map tasks are executed in 4 threads in one JVM,
or will it mean that 4 JVMs be instantiated, each executing 4 map
tasks in individual threads?

thanks,


Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Harsh J
Hi Rob,

On Fri, Feb 10, 2012 at 5:55 PM, Rob Stewart robstewar...@gmail.com wrote:
 I'm looking to clarify the relationship between
 MultithreadedMapper.setNumberOfThreads(i) and
 mapreduce.tasktracker.map.tasks.maximum .

The former is an in-user-application value that controls the total
number of threads to run for map() calls (inside a mapper). This is
_inside_ one JVM (a task, in hadoop terms, is one complete JVM running
user code).

The latter controls, at a TaskTracker level, the max total number of
map-task JVMs that it can run concurrently at any given time.

 What about if I set:
 - MultithreadedMapper.setNumberOfThreads( 4 )
 - mapreduce.tasktracker.map.tasks.maximum = 4

 Will this mean that 4 map tasks are executed in 4 threads in one JVM,
 or will it mean that 4 JVMs be instantiated, each executing 4 map
 tasks in individual threads?

4 JVMs if you have 4 tasks in your Job  (# of map tasks of a job is
dependent on its input).

Each JVM will then run the MultithreadedMapper code, which will then
run 4 threads to call your map() inside of it cause you've asked that
of it.

-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about


Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Rob Stewart
hi Harsh,

On 10 February 2012 12:42, Harsh J ha...@cloudera.com wrote:

 4 JVMs if you have 4 tasks in your Job  (# of map tasks of a job is
 dependent on its input).

 Each JVM will then run the MultithreadedMapper code, which will then
 run 4 threads to call your map() inside of it cause you've asked that
 of it.

So.. the MultithreadedMapper class splits *one* map task into N number
of threads? How is this achieved? I wasn't aware that a map task could
be implicitly sub-divided implicitly? I was under the (false?)
impression that the purpose of a MultithreadedMapper enabled the
opportunity to send N number of independent map tasks to be forked as
threads. ?

Also, from what you say.. if you have map.tasks.maximum = 4 and
setNumberOfThreads(4), then in all, for each compute node, up to 16
threads could be forked at any one time?

I'm trying to identify the performance penalty or performance benefit
of achieving node concurrency with threads, rather than multiple JVMs.
I and I was hoping that setting map.tasks.maximum = 1, and
setNumberOfThreads( #cores ), I would achieve that. Maybe not?

thanks,

--
Rob


Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Harsh J
Rob,

On Fri, Feb 10, 2012 at 6:32 PM, Rob Stewart robstewar...@gmail.com wrote:
 So.. the MultithreadedMapper class splits *one* map task into N number
 of threads? How is this achieved? I wasn't aware that a map task could
 be implicitly sub-divided implicitly? I was under the (false?)
 impression that the purpose of a MultithreadedMapper enabled the
 opportunity to send N number of independent map tasks to be forked as
 threads. ?

Imagine writing your own Mapper code that runs threads to do some
processing when beginning the map() process. MultithreadedMapper is
just an abstraction of something like that, provided for developer
convenience. It makes no relationship with task, task scheduling, or
any other thing higher up in the framework. Does that make it clear?

 Also, from what you say.. if you have map.tasks.maximum = 4 and
 setNumberOfThreads(4), then in all, for each compute node, up to 16
 threads could be forked at any one time?

Yeah you'd be running, at maximum, 4 JVMs, each with 4 threads inside it.

 I'm trying to identify the performance penalty or performance benefit
 of achieving node concurrency with threads, rather than multiple JVMs.
 I and I was hoping that setting map.tasks.maximum = 1, and
 setNumberOfThreads( #cores ), I would achieve that. Maybe not?

What you're missing to see here is that the multithreaded mapper is
something that runs as part of one single map task.

Each map task has a defined input split from which it reads off keys
and values to map() calls.

With just one JVM slot, you'd end up processing only one input-chunk
at a time, though with 4 threads doing map() computation, while with
four slots, you may be processing 4 input-chunks (4 tasks) at the same
time. The choice between the two has to be application-sensitive.

If your work were IO intensive, the slot approach would win at
parallelism. Using single slot with 4 threads when the map()
computation is cheap would be a waste of time you could instead do
more IO with parallel tasks.

But if your work were more CPU intensive, where each map() may take a
long time to run before moving to next, then MTMapper with a set
amount of threads may make more sense to use.

-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about


Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Rob Stewart
Harsh,

On 10 February 2012 13:33, Harsh J ha...@cloudera.com wrote:

 What you're missing to see here is that the multithreaded mapper is
 something that runs as part of one single map task.



 With just one JVM slot, you'd end up processing only one input-chunk
 at a time, though with 4 threads doing map() computation, while with
 four slots, you may be processing 4 input-chunks (4 tasks) at the same
 time. The choice between the two has to be application-sensitive.

OK, take word count. The k,v to the map is null,foo bar lambda
beta. The canonical Hadoop program would tokenize this line of text
and output foo,1 and so on. How would the multithreadedmapper know
how to further divide this line of text into, say: [null,foo
bar,null,lambda beta] for 2 threads to run in parallel? Can you
somehow provide an additional record reader to split the input to the
map task into sub-inputs for each thread?

 If your work were IO intensive, the slot approach would win at
 parallelism.

Are you saying here that 4 single-threaded OS processes can achieve a
higher rate of OS IO, than 4 threads within one OS process doing IO
(which would sound sensible if that's the case).

 Using single slot with 4 threads when the map()
 computation is cheap would be a waste of time you could instead do
 more IO with parallel tasks.

The argument against this approach is that the cost starting up OS
processes is far more expensive that forking threads within processes.
So I would have said the contrary - where map tasks are small and
input size is large, than many JVMs would be instantiated throughout
the system, one per task. Instead, one might speculate that reducing
the number of JVMs, replacing with lower latency thread forking would
improve runtime speeds. ?

 But if your work were more CPU intensive, where each map() may take a
 long time to run before moving to next, then MTMapper with a set
 amount of threads may make more sense to use.

OK, so are you saying:
- For CPU intensive tasks, multiple threads might help
- For IO intensive tasks, multiple OS processes achieve higher
throughput than multiple threads within a smaller number of OS
processes?

Thanks,

--
Rob


Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Harsh J
Hello again,

On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote:
 OK, take word count. The k,v to the map is null,foo bar lambda
 beta. The canonical Hadoop program would tokenize this line of text
 and output foo,1 and so on. How would the multithreadedmapper know
 how to further divide this line of text into, say: [null,foo
 bar,null,lambda beta] for 2 threads to run in parallel? Can you
 somehow provide an additional record reader to split the input to the
 map task into sub-inputs for each thread?

In MultithreadedMapper, the IO work is still single threaded, while
the map() calling post-read is multithreaded. But yes you could use a
mix of CombineFileInputFormat and some custom logic to have multiple
local splits per map task, and divide readers of them among your
threads. But why do all this when thats what slots at the TT are for?
The cost of a single map task failure with your mammoth task approach
would also be higher - more work to repeat.

 Are you saying here that 4 single-threaded OS processes can achieve a
 higher rate of OS IO, than 4 threads within one OS process doing IO
 (which would sound sensible if that's the case).

Yeah thats what I meant, but with the earlier point of In
MultithreadedMapper, the IO work is still single threaded
specifically in mind.

 The argument against this approach is that the cost starting up OS
 processes is far more expensive that forking threads within processes.
 So I would have said the contrary - where map tasks are small and
 input size is large, than many JVMs would be instantiated throughout
 the system, one per task. Instead, one might speculate that reducing
 the number of JVMs, replacing with lower latency thread forking would
 improve runtime speeds. ?

Agreed here.

The JVM startup overhead does exist but I wouldn't think its too high
a cost overall, given the simple benefits it can provide instead.
There is also JVM reuse which makes sense to use for CPU intensive
applications, so you can take advantage of the HotSpot features of the
JVM as it gets reused for running tasks of the same job.

 OK, so are you saying:
 - For CPU intensive tasks, multiple threads might help
 - For IO intensive tasks, multiple OS processes achieve higher
 throughput than multiple threads within a smaller number of OS
 processes?

Yep, but also if you limit your total slots to 1 in favor of going all
for multi-threading, you won't be able to smoothly run multiple jobs
at the same time. Tasks from new jobs may have to wait longer to run,
while in regular slotted environments this is easier to achieve.

-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about


Hadoop 0.21.0 streaming giving no status information

2012-02-10 Thread Patrick Donnelly
Hi,

I'm trying to upgrade an application previously written for Hadoop
0.20.0 for 0.21.0. I'm running into an issue with the status output
missing which is making it difficult to get the jobid/success status:

hadoop/bin/hadoop jar
hadoop/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -D
mapreduce.job.reduces=0 -input file:///dev/null -mapper ./cmd.sh -file
./cmd.sh -output '/users/foo/tmp/job-1234' -verbose  OUTPUT 21

This gives me a bunch of settings output such as:

STREAM: net.topology.script.number.args=100
STREAM: s3.blocksize=67108864
STREAM: s3.bytes-per-checksum=512
STREAM: s3.client-write-packet-size=65536
STREAM: s3.replication=3
STREAM: s3.stream-buffer-size=4096

finally ending with:

STREAM: webinterface.private.actions=false
STREAM: 
STREAM: submitting to jobconf:machine.hostname.domain:8023

After that, I get no further status information. The job does complete
successfully. I would expect to get this type status information:

11/04/23 01:03:24 INFO streaming.StreamJob: getLocalDirs():
[/home/hadoop/hadoop/tmp/dir/hadoop-hadoop/mapred/local]

11/04/23 01:03:24 INFO streaming.StreamJob: Running job: job_201104222325_0021

11/04/23 01:03:24 INFO streaming.StreamJob: To kill this job, run:

11/04/23 01:03:24 INFO streaming.StreamJob:
/home/hadoop/hadoop/bin/../bin/hadoop job
-Dmapred.job.tracker=localhost:54311 -kill job_201104222325_0021

11/04/23 01:03:24 INFO streaming.StreamJob: Tracking URL:
http://localhost:50030/jobdetails.jsp?jobid=job_201104222325_0021

11/04/23 01:03:25 INFO streaming.StreamJob:  map 0%  reduce 0%

11/04/23 01:03:31 INFO streaming.StreamJob:  map 50%  reduce 0%

11/04/23 01:03:41 INFO streaming.StreamJob:  map 50%  reduce 17%

11/04/23 01:03:56 INFO streaming.StreamJob:  map 100%  reduce 100%


I've tried playing with various switches including:

-Dhadoop.root.logger=INFO,console
-Dhadoop.log.file=hadoop.log
-Dhadoop.log.dir=$PWD

but none of these make a difference.

Any help would be greatly appreciated!

-- 
- Patrick Donnelly


Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?

2012-02-10 Thread Varun Kapoor
Hey Merto,

Any luck getting the patch running on your cluster?

In case you're interested, there's now a JIRA for this:
https://issues.apache.org/jira/browse/HADOOP-8052.

Varun

On Wed, Feb 8, 2012 at 7:45 PM, Varun Kapoor rez...@hortonworks.com wrote:

 Your general procedure sounds correct (i.e. dropping your newly built .jar
 into $HD_HOME/lib/), but to make sure it's getting picked up, you should
 explicitly add $HD_HOME/lib/ to your exported HADOOP_CLASSPATH environment
 variable; here's mine, as an example:

 export HADOOP_CLASSPATH=.:./build/*.jar

 About your second point, you certainly need to copy this newly patched
 .jar to every node in your cluster, because my patch changes the value of a
 couple metrics emitted TO gmetad (FROM all the nodes in the cluster), so
 without copying it over to every node in the cluster, gmetad will still
 likely receive some bad metrics.

 Varun


 On Wed, Feb 8, 2012 at 6:19 PM, Merto Mertek masmer...@gmail.com wrote:

 I will need your help. Please confirm if the following procedure is right.
 I have a dev environment where I pimp my scheduler (no hadoop running) and
 a small cluster environment where the changes(jars) are deployed with some
 scripts,  however I have never compiled the whole hadoop from source so I
 do not know if I am doing it right. I' ve done it as follow:

 a) apply a patch
 b) cd $HD_HOME; ant
 c) copy $HD_HOME/*build*/patched-core-hadoop.jar -
 cluster:/$HD_HOME/*lib*
 d) run $HD_HOME/bin/start-all.sh

 Is this enough? When I tried to test hadoop dfs -ls / I could see that a
 new jar was not loaded and instead a jar from
 $HD_HOME/*share*/hadoop-20.205.0.jar
 was taken..
 Should I copy the entire hadoop folder to all nodes and reconfigure the
 entire cluster for the new build, or is enough if I configure it just on
 the node where gmetad will run?






 On 8 February 2012 06:33, Varun Kapoor rez...@hortonworks.com wrote:

  I'm so sorry, Merto - like a silly goose, I attached the 2 patches to my
  reply, and of course the mailing list did not accept the attachment.
 
  I plan on opening JIRAs for this tomorrow, but till then, here are
 links to
  the 2 patches (from my Dropbox account):
 
- http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.Hadoop.patch
- http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.gmetad.patch
 
  Here's hoping this works for you,
 
  Varun
  On Tue, Feb 7, 2012 at 6:00 PM, Merto Mertek masmer...@gmail.com
 wrote:
 
   Varun, have I missed your link to the patches? I have tried to search
  them
   on jira but I did not find them.. Can you repost the link for these
 two
   patches?
  
   Thank you..
  
   On 7 February 2012 20:36, Varun Kapoor rez...@hortonworks.com
 wrote:
  
I'm sorry to hear that gmetad cores continuously for you guys. Since
  I'm
not seeing that behavior, I'm going to just put out the 2 possible
   patches
you could apply and wait to hear back from you. :)
   
Option 1
   
* Apply gmetadBufferOverflow.Hadoop.patch to the relevant file (
   
  
 
 http://svn.apache.org/viewvc/hadoop/common/branches/branch-1/src/core/org/apache/hadoop/metrics2/util/SampleStat.java?view=markupinmysetup)
 in your Hadoop sources and rebuild Hadoop.
   
Option 2
   
* Apply gmetadBufferOverflow.gmetad.patch to gmetad/process_xml.c
 and
rebuild gmetad.
   
Only 1 of these 2 fixes is required, and it would help me if you
 could
first try Option 1 and let me know if that fixes things for you.
   
Varun
   
On Mon, Feb 6, 2012 at 10:36 PM, mete efk...@gmail.com wrote:
   
Same with Merto's situation here, it always overflows short time
 after
   the
restart. Without the hadoop metrics enabled everything is smooth.
Regards
   
Mete
   
On Tue, Feb 7, 2012 at 4:58 AM, Merto Mertek masmer...@gmail.com
   wrote:
   
 I have tried to run it but it repeats crashing..

  - When you start gmetad and Hadoop is not emitting metrics,
   everything
is peachy.
 

 Right, running just ganglia without running hadoop jobs seems
 stable
for at
 least a day..


- When you start Hadoop (and it thus starts emitting
 metrics),
gmetad
cores.
 

 True, with a  following error : *** stack smashing detected ***:
   gmetad
 terminated \n Segmentation fault

 - On my MacBookPro, it's a SIGABRT due to a buffer overflow.
 
  I believe this is happening for everyone. What I would like for
  you
   to
 try
  out are the following 2 scenarios:
 
- Once gmetad cores, if you start it up again, does it core
  again?
Does
this process repeat ad infinitum?
 
 - On my MBP, the core is a one-time thing, and restarting
 gmetad
   after the first core makes things run perfectly smoothly.
  - I know others are saying this core occurs
 continuously,
   but
 they
  were all using ganglia-3.1.x, and I'm 

Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Rob Stewart
Harsh...

Oddly, this blog post has appeared within the last hour or so

http://kickstarthadoop.blogspot.com/2012/02/enable-multiple-threads-in-mapper-aka.html

--
Rob

On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote:
 Hello again,

 On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote:
 OK, take word count. The k,v to the map is null,foo bar lambda
 beta. The canonical Hadoop program would tokenize this line of text
 and output foo,1 and so on. How would the multithreadedmapper know
 how to further divide this line of text into, say: [null,foo
 bar,null,lambda beta] for 2 threads to run in parallel? Can you
 somehow provide an additional record reader to split the input to the
 map task into sub-inputs for each thread?

 In MultithreadedMapper, the IO work is still single threaded, while
 the map() calling post-read is multithreaded. But yes you could use a
 mix of CombineFileInputFormat and some custom logic to have multiple
 local splits per map task, and divide readers of them among your
 threads. But why do all this when thats what slots at the TT are for?
 The cost of a single map task failure with your mammoth task approach
 would also be higher - more work to repeat.

 Are you saying here that 4 single-threaded OS processes can achieve a
 higher rate of OS IO, than 4 threads within one OS process doing IO
 (which would sound sensible if that's the case).

 Yeah thats what I meant, but with the earlier point of In
 MultithreadedMapper, the IO work is still single threaded
 specifically in mind.

 The argument against this approach is that the cost starting up OS
 processes is far more expensive that forking threads within processes.
 So I would have said the contrary - where map tasks are small and
 input size is large, than many JVMs would be instantiated throughout
 the system, one per task. Instead, one might speculate that reducing
 the number of JVMs, replacing with lower latency thread forking would
 improve runtime speeds. ?

 Agreed here.

 The JVM startup overhead does exist but I wouldn't think its too high
 a cost overall, given the simple benefits it can provide instead.
 There is also JVM reuse which makes sense to use for CPU intensive
 applications, so you can take advantage of the HotSpot features of the
 JVM as it gets reused for running tasks of the same job.

 OK, so are you saying:
 - For CPU intensive tasks, multiple threads might help
 - For IO intensive tasks, multiple OS processes achieve higher
 throughput than multiple threads within a smaller number of OS
 processes?

 Yep, but also if you limit your total slots to 1 in favor of going all
 for multi-threading, you won't be able to smoothly run multiple jobs
 at the same time. Tasks from new jobs may have to wait longer to run,
 while in regular slotted environments this is easier to achieve.

 --
 Harsh J
 Customer Ops. Engineer
 Cloudera | http://tiny.cloudera.com/about


Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Rob Stewart
Thanks, this is a lot clearer. One final question...

On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote:
 Hello again,

 On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote:
 OK, take word count. The k,v to the map is null,foo bar lambda
 beta. The canonical Hadoop program would tokenize this line of text
 and output foo,1 and so on. How would the multithreadedmapper know
 how to further divide this line of text into, say: [null,foo
 bar,null,lambda beta] for 2 threads to run in parallel? Can you
 somehow provide an additional record reader to split the input to the
 map task into sub-inputs for each thread?

 In MultithreadedMapper, the IO work is still single threaded, while
 the map() calling post-read is multithreaded. But yes you could use a
 mix of CombineFileInputFormat and some custom logic to have multiple
 local splits per map task, and divide readers of them among your
 threads. But why do all this when thats what slots at the TT are for?

I'm still unsure how the multi-threaded mapper knows how to split the
input value into chunks, one chunk for each thread. There is only one
example in the Hadoop 0.23 trunk that offers an example:
hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java

And in that source code, there is no custom logic for local splits per
map task at all. Again, going back to the word count example. Given a
line of text as input to a map, which comprises of 6 words. I
specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words
analysed by one thread, and the 3 to the other. Is what what would
happen? i.e. - I'm unsure whether the multithreadedmapper class does
the splitting of inputs to map tasks...

Regards,


Fwd: HELP - Problem in setting up Hadoop - Multi-Node Cluster

2012-02-10 Thread Guruprasad B
Dear Robin,

Thanks for your valuable time and response. please find the attached
namenode logs and configurations files.

I am using 2 ubuntu boxes.One as master  slave and other as slave.
below given is the environment set-up in both the machines.

:
Hadoop : hadoop_0.20.2
Linux: Ubuntu Linux 10.10(master) and Ubuntu Linux 11.04(Slave)
Java: java-7-oracle
JAVA_HOME and HADOOP_HOME configuration is done in .bashrc file.

Both the machines are in LAN and able to ping each other. IP address's
of both the machines are configured in /etc/hosts.

I do have SSH access to both master and slave as well.

please let me know if you need any other information.

Thanks in advance.

Regards,
Guruprasad






On Thu, Feb 9, 2012 at 1:06 AM, Robin Mueller-Bady 
robin.mueller-b...@oracle.com wrote:

  Dear Guruprasad,

 it would be very helpful to provide details from your configuration files
 as well as more details on your setup.
 It seems to be that the connection from slave to master cannot be
 established (Connection reset by peer).
 Do you use a virtual environment, physical master/slaves or all on one
 machine ?
 Please paste also the output of kingul2 namenode logs.

 Regards,

 Robin


 On 02/08/12 13:06, Guruprasad B wrote:

 Hi,

 I am Guruprasad from Bangalore (India). I need help in setting up hadoop
 platform. I am very much new to Hadoop Platform.

 I am following the below given articles and I was able to set up
 Single-Node Cluster
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-do

 Now I am trying to set up  
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-doNowIamtryingtosetupMulti-Node
  Cluster by following the below given
 article.http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

 Below given is my setup:
 Hadoop : hadoop_0.20.2
 Linux: Ubuntu Linux 10.10
 Java: java-7-oracle


 I have successfully reached till the topic Starting the multi-node
 cluster in the above given article.
 When I start the HDFS/MapReduce daemons it is getting started and going
 down immediately both in master  slave as well,
 please have a look at the below logs,

 hduser@kinigul2:/usr/local/hadoop$ bin/start-dfs.sh
 starting namenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-kinigul2.out
 master: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-kinigul2.out
 slave: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-guruL.out
 master: starting secondarynamenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-kinigul2.out

 hduser@kinigul2:/usr/local/hadoop$ jps
 6098 DataNode
 6328 Jps
 5914 NameNode
 6276 SecondaryNameNode

 hduser@kinigul2:/usr/local/hadoop$ jps
 6350 Jps


 I am getting below given error in slave logs:

 2012-02-08 21:04:01,641 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call
 to master/16.150.98.62:54310 failed on local exception:
 java.io.IOException: Connection reset by peer
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
 at org.apache.hadoop.ipc.Client.call(Client.java:743)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy4.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 Caused by: java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218)
 at sun.nio.ch.IOUtil.read(IOUtil.java:191)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
 at
 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
 at java.io.FilterInputStream.read(FilterInputStream.java:133)
 at
 

Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread bejoy . hadoop
Hi Rob
   I'm the culprit who posted the blog. :) The topic was of my interest as 
well and  I found the conversation informative and useful. Just thought of 
documenting the same as it could be useful for others as well in future. Hope 
you don't mind!.. 

Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Rob Stewart robstewar...@gmail.com
Date: Fri, 10 Feb 2012 18:30:53 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: Combining MultithreadedMapper threadpool size  map.tasks.maximum

Harsh...

Oddly, this blog post has appeared within the last hour or so

http://kickstarthadoop.blogspot.com/2012/02/enable-multiple-threads-in-mapper-aka.html

--
Rob

On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote:
 Hello again,

 On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote:
 OK, take word count. The k,v to the map is null,foo bar lambda
 beta. The canonical Hadoop program would tokenize this line of text
 and output foo,1 and so on. How would the multithreadedmapper know
 how to further divide this line of text into, say: [null,foo
 bar,null,lambda beta] for 2 threads to run in parallel? Can you
 somehow provide an additional record reader to split the input to the
 map task into sub-inputs for each thread?

 In MultithreadedMapper, the IO work is still single threaded, while
 the map() calling post-read is multithreaded. But yes you could use a
 mix of CombineFileInputFormat and some custom logic to have multiple
 local splits per map task, and divide readers of them among your
 threads. But why do all this when thats what slots at the TT are for?
 The cost of a single map task failure with your mammoth task approach
 would also be higher - more work to repeat.

 Are you saying here that 4 single-threaded OS processes can achieve a
 higher rate of OS IO, than 4 threads within one OS process doing IO
 (which would sound sensible if that's the case).

 Yeah thats what I meant, but with the earlier point of In
 MultithreadedMapper, the IO work is still single threaded
 specifically in mind.

 The argument against this approach is that the cost starting up OS
 processes is far more expensive that forking threads within processes.
 So I would have said the contrary - where map tasks are small and
 input size is large, than many JVMs would be instantiated throughout
 the system, one per task. Instead, one might speculate that reducing
 the number of JVMs, replacing with lower latency thread forking would
 improve runtime speeds. ?

 Agreed here.

 The JVM startup overhead does exist but I wouldn't think its too high
 a cost overall, given the simple benefits it can provide instead.
 There is also JVM reuse which makes sense to use for CPU intensive
 applications, so you can take advantage of the HotSpot features of the
 JVM as it gets reused for running tasks of the same job.

 OK, so are you saying:
 - For CPU intensive tasks, multiple threads might help
 - For IO intensive tasks, multiple OS processes achieve higher
 throughput than multiple threads within a smaller number of OS
 processes?

 Yep, but also if you limit your total slots to 1 in favor of going all
 for multi-threading, you won't be able to smoothly run multiple jobs
 at the same time. Tasks from new jobs may have to wait longer to run,
 while in regular slotted environments this is easier to achieve.

 --
 Harsh J
 Customer Ops. Engineer
 Cloudera | http://tiny.cloudera.com/about


Re: HELP - Problem in setting up Hadoop - Multi-Node Cluster

2012-02-10 Thread Guruprasad B
Dear Robin,

Yes, it is possible.

Regards,
Guru

On Fri, Feb 10, 2012 at 1:23 PM, Robin Mueller-Bady 
robin.mueller-b...@oracle.com wrote:

  Dear Guruprasad,

 is it possible to ping both machines with their hostnames ? (ping master /
 ping slave) ?

 Regards,

 Robin

 On 10.02.2012 07:46, Guruprasad B wrote:

 Dear Robin,

 Thanks for your valuable time and response. please find the attached
 namenode logs and configurations files.

 I am using 2 ubuntu boxes.One as master  slave and other as slave.
 below given is the environment set-up in both the machines.

 :
 Hadoop : hadoop_0.20.2
 Linux: Ubuntu Linux 10.10(master) and Ubuntu Linux 11.04(Slave)
 Java: java-7-oracle
 JAVA_HOME and HADOOP_HOME configuration is done in .bashrc file.

 Both the machines are in LAN and able to ping each other. IP address's of 
 both the machines are configured in /etc/hosts.

 I do have SSH access to both master and slave as well.

 please let me know if you need any other information.

 Thanks in advance.

 Regards,
 Guruprasad







  On Thu, Feb 9, 2012 at 1:06 AM, Robin Mueller-Bady 
 robin.mueller-b...@oracle.com wrote:

  Dear Guruprasad,

 it would be very helpful to provide details from your configuration files
 as well as more details on your setup.
 It seems to be that the connection from slave to master cannot be
 established (Connection reset by peer).
 Do you use a virtual environment, physical master/slaves or all on one
 machine ?
 Please paste also the output of kingul2 namenode logs.

 Regards,

 Robin


 On 02/08/12 13:06, Guruprasad B wrote:

 Hi,

 I am Guruprasad from Bangalore (India). I need help in setting up hadoop
 platform. I am very much new to Hadoop Platform.

 I am following the below given articles and I was able to set up
 Single-Node Cluster
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-do

 Now I am trying to set up  
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-doNowIamtryingtosetupMulti-Node
  Cluster by following the below given
 article.http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

 Below given is my setup:
 Hadoop : hadoop_0.20.2
 Linux: Ubuntu Linux 10.10
 Java: java-7-oracle


 I have successfully reached till the topic Starting the multi-node
 cluster in the above given article.
 When I start the HDFS/MapReduce daemons it is getting started and going
 down immediately both in master  slave as well,
 please have a look at the below logs,

 hduser@kinigul2:/usr/local/hadoop$ bin/start-dfs.sh
 starting namenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-kinigul2.out
 master: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-kinigul2.out
 slave: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-guruL.out
 master: starting secondarynamenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-kinigul2.out

 hduser@kinigul2:/usr/local/hadoop$ jps
 6098 DataNode
 6328 Jps
 5914 NameNode
 6276 SecondaryNameNode

 hduser@kinigul2:/usr/local/hadoop$ jps
 6350 Jps


 I am getting below given error in slave logs:

 2012-02-08 21:04:01,641 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call
 to master/16.150.98.62:54310 failed on local exception:
 java.io.IOException: Connection reset by peer
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
 at org.apache.hadoop.ipc.Client.call(Client.java:743)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy4.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 Caused by: java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218)
 at sun.nio.ch.IOUtil.read(IOUtil.java:191)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
 at
 

Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread bejoy . hadoop
Hi Rob
   I'd try to answer this. From my understanding if you are using 
Multithreaded mapper on word count example with TextInputFormat and imagine you 
have 2 threads and 2 lines in your input split . RecordReader would read Line 1 
and give it to map thread 1 and line 2 to map thread 2. So kind of identical 
process as defined would be happening with these two lines in parallel. This 
would be the default behavior.
Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Rob Stewart robstewar...@gmail.com
Date: Fri, 10 Feb 2012 18:39:44 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: Combining MultithreadedMapper threadpool size  map.tasks.maximum

Thanks, this is a lot clearer. One final question...

On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote:
 Hello again,

 On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote:
 OK, take word count. The k,v to the map is null,foo bar lambda
 beta. The canonical Hadoop program would tokenize this line of text
 and output foo,1 and so on. How would the multithreadedmapper know
 how to further divide this line of text into, say: [null,foo
 bar,null,lambda beta] for 2 threads to run in parallel? Can you
 somehow provide an additional record reader to split the input to the
 map task into sub-inputs for each thread?

 In MultithreadedMapper, the IO work is still single threaded, while
 the map() calling post-read is multithreaded. But yes you could use a
 mix of CombineFileInputFormat and some custom logic to have multiple
 local splits per map task, and divide readers of them among your
 threads. But why do all this when thats what slots at the TT are for?

I'm still unsure how the multi-threaded mapper knows how to split the
input value into chunks, one chunk for each thread. There is only one
example in the Hadoop 0.23 trunk that offers an example:
hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java

And in that source code, there is no custom logic for local splits per
map task at all. Again, going back to the word count example. Given a
line of text as input to a map, which comprises of 6 words. I
specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words
analysed by one thread, and the 3 to the other. Is what what would
happen? i.e. - I'm unsure whether the multithreadedmapper class does
the splitting of inputs to map tasks...

Regards,


Where Is DataJoinMapperBase?

2012-02-10 Thread Bing Li
Hi, all,

I am starting to learn advanced Map/Reduce. However, I cannot find the
class DataJoinMapperBase in my downloaded Hadoop 1.0.0 and 0.20.2. So I
searched on the Web and get the following link.

 http://www.java2s.com/Code/Jar/h/Downloadhadoop0201datajoinjar.htm

From the link I got the package, hadoop-0.20.1-datajoin.jar. My question is
why the package is not included in Hadoop 1.0.0 and 0.20.2? Is the correct
way to get it?

Thanks so much!

Best regards,
Bing


Re: Combining MultithreadedMapper threadpool size map.tasks.maximum

2012-02-10 Thread Raj Vishwanathan



Here is what I understand 

The RecordReader for the MTMappert takes the input split and cycles the records 
among the available threads. It also ensures that the map outputs are 
synchronized. 

So what Bejoy says is what will happen for the wordcount program. 

Raj




 From: bejoy.had...@gmail.com bejoy.had...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Friday, February 10, 2012 11:15 AM
Subject: Re: Combining MultithreadedMapper threadpool size  map.tasks.maximum
 
Hi Rob
       I'd try to answer this. From my understanding if you are using 
Multithreaded mapper on word count example with TextInputFormat and imagine 
you have 2 threads and 2 lines in your input split . RecordReader would read 
Line 1 and give it to map thread 1 and line 2 to map thread 2. So kind of 
identical process as defined would be happening with these two lines in 
parallel. This would be the default behavior.
Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Rob Stewart robstewar...@gmail.com
Date: Fri, 10 Feb 2012 18:39:44 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: Combining MultithreadedMapper threadpool size  map.tasks.maximum

Thanks, this is a lot clearer. One final question...

On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote:
 Hello again,

 On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote:
 OK, take word count. The k,v to the map is null,foo bar lambda
 beta. The canonical Hadoop program would tokenize this line of text
 and output foo,1 and so on. How would the multithreadedmapper know
 how to further divide this line of text into, say: [null,foo
 bar,null,lambda beta] for 2 threads to run in parallel? Can you
 somehow provide an additional record reader to split the input to the
 map task into sub-inputs for each thread?

 In MultithreadedMapper, the IO work is still single threaded, while
 the map() calling post-read is multithreaded. But yes you could use a
 mix of CombineFileInputFormat and some custom logic to have multiple
 local splits per map task, and divide readers of them among your
 threads. But why do all this when thats what slots at the TT are for?

I'm still unsure how the multi-threaded mapper knows how to split the
input value into chunks, one chunk for each thread. There is only one
example in the Hadoop 0.23 trunk that offers an example:
hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java

And in that source code, there is no custom logic for local splits per
map task at all. Again, going back to the word count example. Given a
line of text as input to a map, which comprises of 6 words. I
specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words
analysed by one thread, and the 3 to the other. Is what what would
happen? i.e. - I'm unsure whether the multithreadedmapper class does
the splitting of inputs to map tasks...

Regards,




Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?

2012-02-10 Thread Merto Mertek
Varun unfortunately I have had some problems with deploying a new version
on the cluster.. Hadoop is not picking the new build in lib folder despite
a classpath is set to it. The new build is picked just if I put it in the
$HD_HOME/share/hadoop/, which is very strange.. I've done this on all nodes
and can access the web, but all tasktracker are being stopped because of an
error:

INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup...
 java.lang.InterruptedException: sleep interrupted
 at java.lang.Thread.sleep(Native Method)
 at
 org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926)



Probably the error is the consequence of an inadequate deploy of a jar.. I
will ask to the dev list how they do it or are you maybe having any other
idea?



On 10 February 2012 17:10, Varun Kapoor rez...@hortonworks.com wrote:

 Hey Merto,

 Any luck getting the patch running on your cluster?

 In case you're interested, there's now a JIRA for this:
 https://issues.apache.org/jira/browse/HADOOP-8052.

 Varun

 On Wed, Feb 8, 2012 at 7:45 PM, Varun Kapoor rez...@hortonworks.com
 wrote:

  Your general procedure sounds correct (i.e. dropping your newly built
 .jar
  into $HD_HOME/lib/), but to make sure it's getting picked up, you should
  explicitly add $HD_HOME/lib/ to your exported HADOOP_CLASSPATH
 environment
  variable; here's mine, as an example:
 
  export HADOOP_CLASSPATH=.:./build/*.jar
 
  About your second point, you certainly need to copy this newly patched
  .jar to every node in your cluster, because my patch changes the value
 of a
  couple metrics emitted TO gmetad (FROM all the nodes in the cluster), so
  without copying it over to every node in the cluster, gmetad will still
  likely receive some bad metrics.
 
  Varun
 
 
  On Wed, Feb 8, 2012 at 6:19 PM, Merto Mertek masmer...@gmail.com
 wrote:
 
  I will need your help. Please confirm if the following procedure is
 right.
  I have a dev environment where I pimp my scheduler (no hadoop running)
 and
  a small cluster environment where the changes(jars) are deployed with
 some
  scripts,  however I have never compiled the whole hadoop from source so
 I
  do not know if I am doing it right. I' ve done it as follow:
 
  a) apply a patch
  b) cd $HD_HOME; ant
  c) copy $HD_HOME/*build*/patched-core-hadoop.jar -
  cluster:/$HD_HOME/*lib*
  d) run $HD_HOME/bin/start-all.sh
 
  Is this enough? When I tried to test hadoop dfs -ls / I could see
 that a
  new jar was not loaded and instead a jar from
  $HD_HOME/*share*/hadoop-20.205.0.jar
  was taken..
  Should I copy the entire hadoop folder to all nodes and reconfigure the
  entire cluster for the new build, or is enough if I configure it just on
  the node where gmetad will run?
 
 
 
 
 
 
  On 8 February 2012 06:33, Varun Kapoor rez...@hortonworks.com wrote:
 
   I'm so sorry, Merto - like a silly goose, I attached the 2 patches to
 my
   reply, and of course the mailing list did not accept the attachment.
  
   I plan on opening JIRAs for this tomorrow, but till then, here are
  links to
   the 2 patches (from my Dropbox account):
  
 - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.Hadoop.patch
 - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.gmetad.patch
  
   Here's hoping this works for you,
  
   Varun
   On Tue, Feb 7, 2012 at 6:00 PM, Merto Mertek masmer...@gmail.com
  wrote:
  
Varun, have I missed your link to the patches? I have tried to
 search
   them
on jira but I did not find them.. Can you repost the link for these
  two
patches?
   
Thank you..
   
On 7 February 2012 20:36, Varun Kapoor rez...@hortonworks.com
  wrote:
   
 I'm sorry to hear that gmetad cores continuously for you guys.
 Since
   I'm
 not seeing that behavior, I'm going to just put out the 2 possible
patches
 you could apply and wait to hear back from you. :)

 Option 1

 * Apply gmetadBufferOverflow.Hadoop.patch to the relevant file (

   
  
 
 http://svn.apache.org/viewvc/hadoop/common/branches/branch-1/src/core/org/apache/hadoop/metrics2/util/SampleStat.java?view=markupinmysetup
 )
  in your Hadoop sources and rebuild Hadoop.

 Option 2

 * Apply gmetadBufferOverflow.gmetad.patch to gmetad/process_xml.c
  and
 rebuild gmetad.

 Only 1 of these 2 fixes is required, and it would help me if you
  could
 first try Option 1 and let me know if that fixes things for you.

 Varun

 On Mon, Feb 6, 2012 at 10:36 PM, mete efk...@gmail.com wrote:

 Same with Merto's situation here, it always overflows short time
  after
the
 restart. Without the hadoop metrics enabled everything is smooth.
 Regards

 Mete

 On Tue, Feb 7, 2012 at 4:58 AM, Merto Mertek 
 masmer...@gmail.com
wrote:

  I have tried to run it but it repeats crashing..
 
   - When you start gmetad and 

Re: HELP - Problem in setting up Hadoop - Multi-Node Cluster

2012-02-10 Thread anil gupta
Hi,

Is your datanode initially able to connect to Namenode? Have you disabled
all the firewalls related services? Do you see any errors at the startup
log of Namenode or Datanode?

I have dealt with similar kind of this problem earlier.
So here is what you can try to do:
First, test that ssh is working fine to ensure network is working fine. Ssh
into slave from master and ssh into master from the same slave. Leave the
ssh session open for as long as u can.
In my case when I did the above experiment the ssh session was dropping so
I got to know that it's a network related problem. It has got nothing to do
with Hadoop.

This post might be helpful for u:
https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/4165f39d8b0bbc56

Best Regards,
Anil

On Fri, Feb 10, 2012 at 1:43 AM, Guruprasad B guruprasadk...@gmail.comwrote:

 Dear Robin,

 Yes, it is possible.

 Regards,
 Guru

 On Fri, Feb 10, 2012 at 1:23 PM, Robin Mueller-Bady 
 robin.mueller-b...@oracle.com wrote:

  Dear Guruprasad,

 is it possible to ping both machines with their hostnames ? (ping master
 / ping slave) ?

 Regards,

 Robin


 On 10.02.2012 07:46, Guruprasad B wrote:

 Dear Robin,

 Thanks for your valuable time and response. please find the attached
 namenode logs and configurations files.

 I am using 2 ubuntu boxes.One as master  slave and other as slave.
 below given is the environment set-up in both the machines.

 :
 Hadoop : hadoop_0.20.2
 Linux: Ubuntu Linux 10.10(master) and Ubuntu Linux 11.04(Slave)
 Java: java-7-oracle
 JAVA_HOME and HADOOP_HOME configuration is done in .bashrc file.

 Both the machines are in LAN and able to ping each other. IP address's of 
 both the machines are configured in /etc/hosts.

 I do have SSH access to both master and slave as well.

 please let me know if you need any other information.

 Thanks in advance.

 Regards,
 Guruprasad







  On Thu, Feb 9, 2012 at 1:06 AM, Robin Mueller-Bady 
 robin.mueller-b...@oracle.com wrote:

  Dear Guruprasad,

 it would be very helpful to provide details from your configuration
 files as well as more details on your setup.
 It seems to be that the connection from slave to master cannot be
 established (Connection reset by peer).
 Do you use a virtual environment, physical master/slaves or all on one
 machine ?
 Please paste also the output of kingul2 namenode logs.

 Regards,

 Robin


 On 02/08/12 13:06, Guruprasad B wrote:

 Hi,

 I am Guruprasad from Bangalore (India). I need help in setting up hadoop
 platform. I am very much new to Hadoop Platform.

 I am following the below given articles and I was able to set up
 Single-Node Cluster
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-do

 Now I am trying to set up  
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-doNowIamtryingtosetupMulti-Node
  Cluster by following the below given
 article.http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

 Below given is my setup:
 Hadoop : hadoop_0.20.2
 Linux: Ubuntu Linux 10.10
 Java: java-7-oracle


 I have successfully reached till the topic Starting the multi-node
 cluster in the above given article.
 When I start the HDFS/MapReduce daemons it is getting started and going
 down immediately both in master  slave as well,
 please have a look at the below logs,

 hduser@kinigul2:/usr/local/hadoop$ bin/start-dfs.sh
 starting namenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-kinigul2.out
 master: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-kinigul2.out
 slave: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-guruL.out
 master: starting secondarynamenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-kinigul2.out

 hduser@kinigul2:/usr/local/hadoop$ jps
 6098 DataNode
 6328 Jps
 5914 NameNode
 6276 SecondaryNameNode

 hduser@kinigul2:/usr/local/hadoop$ jps
 6350 Jps


 I am getting below given error in slave logs:

 2012-02-08 21:04:01,641 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call
 to master/16.150.98.62:54310 failed on local exception:
 java.io.IOException: Connection reset by peer
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
 at org.apache.hadoop.ipc.Client.call(Client.java:743)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy4.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269)
 at
 

is 1.0.0 stable?

2012-02-10 Thread Stan Kaushanskiy

Hi everyone,

I would imagine that 1.0.0 is stable, but the stable link still takes 
one to the 0.20.203 release.  Is 1.0.0 ready for production usage? If 
not what about 0.20.205?


thanks,

stan


Re: reference document which properties are set in which configuration file

2012-02-10 Thread Harsh J
As a thumb rule, all properties starting with mapred.* or mapreduce.*
go to mapred-site.xml, all properties starting with dfs.* go to
hdfs-site.xml, and the rest may be put in core-site.xml to be safe.

In case you notice MR or HDFS specific properties being outside of
this naming convention, please do report a JIRA so we can deprecate
the old name and rename it with a more appropriate prefix.

On Sat, Feb 11, 2012 at 9:27 AM, Praveen Sripati
praveensrip...@gmail.com wrote:
 The mapred.task.tracker.http.address will go in the mapred-site.xml file.

 In the Hadoop installation directory check the core-default.xml,
 hdfs-default,xml and mapred-default.xml files to know about the different
 properties. Some of the properties which might be in the code may not be
 mentioned in the xml files and will be defaulted.

 Praveen

 On Tue, Feb 7, 2012 at 3:30 PM, Kleegrewe, Christian 
 christian.kleegr...@siemens.com wrote:

 Dear all,

 while configuring our hadoop cluster I wonder whether there exists a
 reference document that contains information about which configuration
 property has to be specified in which properties file. Especially I do not
 know where the mapred.task.tracker.http.address has to be set. Is it in the
 mapre-site.xml or in the hdfs-site.xml?

 any hint will be appreciated

 thanks

 Christian


 8--
 Siemens AG
 Corporate Technology
 Corporate Research and Technologies
 CT T DE IT3
 Otto-Hahn-Ring 6
 81739 München, Deutschland
 Tel.: +49 89 636-42722
 Fax: +49 89 636-41423
 mailto:christian.kleegr...@siemens.com

 Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard
 Cromme; Vorstand: Peter Löscher, Vorsitzender; Roland Busch, Brigitte
 Ederer, Klaus Helmrich, Joe Kaeser, Barbara Kux, Hermann Requardt,
 Siegfried Russwurm, Peter Y. Solmssen, Michael Süß; Sitz der Gesellschaft:
 Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg,
 HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322






-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about


Re: reference document which properties are set in which configuration file

2012-02-10 Thread Raj Vishwanathan
Harsh, All

This was one of the first questions that  I asked. It is sometimes not clear 
whether some parameters are site related  or jab related or whether it belongs 
to NN, JT , DN or TT.

If I get some time during the weekend , I will try and put this into a document 
and see if it helps

Raj




 From: Harsh J ha...@cloudera.com
To: common-user@hadoop.apache.org 
Sent: Friday, February 10, 2012 8:31 PM
Subject: Re: reference document which properties are set in which 
configuration file
 
As a thumb rule, all properties starting with mapred.* or mapreduce.*
go to mapred-site.xml, all properties starting with dfs.* go to
hdfs-site.xml, and the rest may be put in core-site.xml to be safe.

In case you notice MR or HDFS specific properties being outside of
this naming convention, please do report a JIRA so we can deprecate
the old name and rename it with a more appropriate prefix.

On Sat, Feb 11, 2012 at 9:27 AM, Praveen Sripati
praveensrip...@gmail.com wrote:
 The mapred.task.tracker.http.address will go in the mapred-site.xml file.

 In the Hadoop installation directory check the core-default.xml,
 hdfs-default,xml and mapred-default.xml files to know about the different
 properties. Some of the properties which might be in the code may not be
 mentioned in the xml files and will be defaulted.

 Praveen

 On Tue, Feb 7, 2012 at 3:30 PM, Kleegrewe, Christian 
 christian.kleegr...@siemens.com wrote:

 Dear all,

 while configuring our hadoop cluster I wonder whether there exists a
 reference document that contains information about which configuration
 property has to be specified in which properties file. Especially I do not
 know where the mapred.task.tracker.http.address has to be set. Is it in the
 mapre-site.xml or in the hdfs-site.xml?

 any hint will be appreciated

 thanks

 Christian


 8--
 Siemens AG
 Corporate Technology
 Corporate Research and Technologies
 CT T DE IT3
 Otto-Hahn-Ring 6
 81739 München, Deutschland
 Tel.: +49 89 636-42722
 Fax: +49 89 636-41423
 mailto:christian.kleegr...@siemens.com

 Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard
 Cromme; Vorstand: Peter Löscher, Vorsitzender; Roland Busch, Brigitte
 Ederer, Klaus Helmrich, Joe Kaeser, Barbara Kux, Hermann Requardt,
 Siegfried Russwurm, Peter Y. Solmssen, Michael Süß; Sitz der Gesellschaft:
 Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg,
 HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322






-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about