date:20130923

Re: A couple of Questions on InputFormat

2013-09-23 Thread Harsh J

Hi,

(I'm assuming 1.0~ MR here)

On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis lordjoe2...@gmail.com wrote:
 Classes implementing InputFormat implement
  public ListInputSplit getSplits(JobContext job) which a List if
 InputSplits. for FileInputFormat the Splits have Path.start and End

 1) When is this method called and on which JVM on Which Machine and is it
 called only once?

Called only at a client, i.e. your hadoop jar JVM. Called only once.

 2) Do the number of Map task correspond to the number of splits returned by
 getSplits?

Yes, number of split objects == number of mappers.

 3) InputFormat implements a method
  RecordReaderK,V createRecordReader(InputSplit split,TaskAttemptContext
 context ). Is this  executed within the JVM of the Mapper on the slave
 machine and does the RecordReader run within that JVM

RecordReaders are not created on the client side JVM. RecordReaders
are created on the Map task JVMs, and run inside it.

 4) The default RecordReaders read a file from the start position to the end
 position emitting values in the order read. With such a reader, assume it is
 reading lines of text, is it reasonable to assume that the values the mapper
 received are in the same order they were found in a file? Would it, for
 example, be possible for WordCount to see a word that was hyphen-
 ated at the end of one line and append the first word of the next line it
 sees (ignoring the case where the word is at the end of a split)

If you speak of the LineRecordReader, each map() will simply read a
line, i.e. until \n. It is not language-aware to understand meaning of
hyphens, etc..

You can implement a custom reader to do this however - there should be
no problems so long as your logic covers the case of not having any
duplicate reads across multiple maps.

-- 
Harsh J

Re: A couple of Questions on InputFormat

2013-09-23 Thread Steve Lewis

Thank you for your thorough answer
The last question is essentially this - while I can write a custom input
format to handle things like hyphens I
could do almost the same thing in the mapper by saving any hyphenated words
from the last line (ignoring hyphenated words that
cross a split boundary) as long as  LineRecordReader guarantees that each
line in the split is sent to the same mapper in the order read.
This seems to be the case - right?


On Mon, Sep 23, 2013 at 4:30 AM, Harsh J ha...@cloudera.com wrote:

 Hi,

 (I'm assuming 1.0~ MR here)

 On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis lordjoe2...@gmail.com
 wrote:
  Classes implementing InputFormat implement
   public ListInputSplit getSplits(JobContext job) which a List if
  InputSplits. for FileInputFormat the Splits have Path.start and End
 
  1) When is this method called and on which JVM on Which Machine and is it
  called only once?

 Called only at a client, i.e. your hadoop jar JVM. Called only once.

  2) Do the number of Map task correspond to the number of splits returned
 by
  getSplits?

 Yes, number of split objects == number of mappers.

  3) InputFormat implements a method
   RecordReaderK,V createRecordReader(InputSplit split,TaskAttemptContext
  context ). Is this  executed within the JVM of the Mapper on the slave
  machine and does the RecordReader run within that JVM

 RecordReaders are not created on the client side JVM. RecordReaders
 are created on the Map task JVMs, and run inside it.

  4) The default RecordReaders read a file from the start position to the
 end
  position emitting values in the order read. With such a reader, assume
 it is
  reading lines of text, is it reasonable to assume that the values the
 mapper
  received are in the same order they were found in a file? Would it, for
  example, be possible for WordCount to see a word that was hyphen-
  ated at the end of one line and append the first word of the next line it
  sees (ignoring the case where the word is at the end of a split)

 If you speak of the LineRecordReader, each map() will simply read a
 line, i.e. until \n. It is not language-aware to understand meaning of
 hyphens, etc..

 You can implement a custom reader to do this however - there should be
 no problems so long as your logic covers the case of not having any
 duplicate reads across multiple maps.

 --
 Harsh J




-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: A couple of Questions on InputFormat

2013-09-23 Thread Harsh J

Hi,

Yes, that is right.

On Mon, Sep 23, 2013 at 9:04 PM, Steve Lewis lordjoe2...@gmail.com wrote:
 Thank you for your thorough answer
 The last question is essentially this - while I can write a custom input
 format to handle things like hyphens I
 could do almost the same thing in the mapper by saving any hyphenated words
 from the last line (ignoring hyphenated words that
 cross a split boundary) as long as  LineRecordReader guarantees that each
 line in the split is sent to the same mapper in the order read.
 This seems to be the case - right?


 On Mon, Sep 23, 2013 at 4:30 AM, Harsh J ha...@cloudera.com wrote:

 Hi,

 (I'm assuming 1.0~ MR here)

 On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis lordjoe2...@gmail.com
 wrote:
  Classes implementing InputFormat implement
   public ListInputSplit getSplits(JobContext job) which a List if
  InputSplits. for FileInputFormat the Splits have Path.start and End
 
  1) When is this method called and on which JVM on Which Machine and is
  it
  called only once?

 Called only at a client, i.e. your hadoop jar JVM. Called only once.

  2) Do the number of Map task correspond to the number of splits returned
  by
  getSplits?

 Yes, number of split objects == number of mappers.

  3) InputFormat implements a method
   RecordReaderK,V createRecordReader(InputSplit
  split,TaskAttemptContext
  context ). Is this  executed within the JVM of the Mapper on the slave
  machine and does the RecordReader run within that JVM

 RecordReaders are not created on the client side JVM. RecordReaders
 are created on the Map task JVMs, and run inside it.

  4) The default RecordReaders read a file from the start position to the
  end
  position emitting values in the order read. With such a reader, assume
  it is
  reading lines of text, is it reasonable to assume that the values the
  mapper
  received are in the same order they were found in a file? Would it, for
  example, be possible for WordCount to see a word that was hyphen-
  ated at the end of one line and append the first word of the next line
  it
  sees (ignoring the case where the word is at the end of a split)

 If you speak of the LineRecordReader, each map() will simply read a
 line, i.e. until \n. It is not language-aware to understand meaning of
 hyphens, etc..

 You can implement a custom reader to do this however - there should be
 no problems so long as your logic covers the case of not having any
 duplicate reads across multiple maps.

 --
 Harsh J




 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




-- 
Harsh J

Re: Lifecycle dates for Apache Hadoop versions

2013-09-23 Thread Roman Shaposhnik

Hi Arun,

On Sun, Sep 22, 2013 at 8:04 AM, H G, Arun arun@hp.com wrote:
 HI

 Are there end of life / support dates published for Apache Hadoop versions?
 Plz direct me to a link, if there is such information.

I don't think anybody can answer this question once and for all.
Apache Hadoop is an ASF project, not a software product. As
such the longevity of every particular branch of development
is only governed by the level of interest that the community at
large has in it. For example, if there was somebody interested
in continuing development of Hadoop 0.20.x branch there would
be nothing (short of 3 PMC votes) stopping releases on that
branch.

That said, one broad observation that may be helpful to you
is that it seems that most of the community these days
seems to be investing in Hadoop 2.x codeline over anything
else.

Thanks,
Roman.

答复: Problem building hadoop 2 native

2013-09-23 Thread 谢良

The log looks to me that you need to install  cmake, or it doesn't appear in 
PATH.

发件人: Alon Grinshpoon [alo...@mellanox.com]
发送时间: 2013年9月23日 14:49
收件人: user@hadoop.apache.org
抄送: Avner Ben Hanoch; Elad Itzhakian
主题: Problem building hadoop 2 native

Hello,

I am trying the build hadoop 2.0.5/2.1.0 with native (so compression can used):

mvn package -Pnative -Pdist -DskipTests -Dtar

but it fails.

The problem occurs during the building of “Hadoop-Common” and the error is:
[ERROR] Could not find goal 'protoc' in plugin 
org.apache.hadoop:hadoop-maven-plugins:2.1.0-beta among available goals - 
[Help 1]
org.apache.maven.plugin.MojoNotFoundException: Could not find goal 'protoc' in 
plugin org.apache.hadoop:hadoop-maven-plugins:2.1.0-beta among available goals
(I’m attaching the full build log in case you want to review it)

Building failed on both Hadoop 2.1.0 and 2.0.5, using protobuf 2.4.1 and 2.5.0.

Please help!
Thank you ☺

Queues in Fair Scheduler

2013-09-23 Thread Anurag Tangri

Hi,
We are trying to implement queues in fair scheduler using mapred acls.

If I setup queues and try to use them in fair scheduler, then if I don't
add following two properties to the job, the job fails:

 -Dmapred.job.queue.name=queue name
-D mapreduce.job.acl-view-job=*


Is that correct ? If yes, is there a way to set a default queue per user
and similarly better way to set acl-view property ?


Thanks,
Anurag Tangri

RE: Problem building hadoop 2 native

2013-09-23 Thread Alon Grinshpoon

You are right, It works! Thank you very much! 

-Original Message-
From: 谢良 [mailto:xieli...@xiaomi.com] 
Sent: Monday, September 23, 2013 10:17 AM
To: user@hadoop.apache.org
Cc: Avner Ben Hanoch; Elad Itzhakian
Subject: 答复: Problem building hadoop 2 native

The log looks to me that you need to install  cmake, or it doesn't appear in 
PATH.

发件人: Alon Grinshpoon [alo...@mellanox.com]
发送时间: 2013年9月23日 14:49
收件人: user@hadoop.apache.org
抄送: Avner Ben Hanoch; Elad Itzhakian
主题: Problem building hadoop 2 native

Hello,

I am trying the build hadoop 2.0.5/2.1.0 with native (so compression can used):

mvn package -Pnative -Pdist -DskipTests -Dtar

but it fails.

The problem occurs during the building of “Hadoop-Common” and the error is:
[ERROR] Could not find goal 'protoc' in plugin 
org.apache.hadoop:hadoop-maven-plugins:2.1.0-beta among available goals - 
[Help 1]
org.apache.maven.plugin.MojoNotFoundException: Could not find goal 'protoc' in 
plugin org.apache.hadoop:hadoop-maven-plugins:2.1.0-beta among available goals 
(I’m attaching the full build log in case you want to review it)

Building failed on both Hadoop 2.1.0 and 2.0.5, using protobuf 2.4.1 and 2.5.0.

Please help!
Thank you ☺

Re: How to best decide mapper output/reducer input for a huge string?

2013-09-23 Thread Pavan Sudheendra

@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
are functional at the same time.. Although, @Pradeep, i should do the
compression like you say.. I'll give it a shot.. As far as i can see, i
think i'll need to implement Writable and write out the key of the mapper
using the specific data types instead of writing it out as a string which
might slow the operation down..

On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota pradeep...@gmail.comwrote:

Pavan,

It's hard to tell whether there's anything wrong with your design or not
since you haven't given us specific enough details. The best thing you can
do is instrument your code and see what is taking a long time. Rahul
mentioned a problem that I myself have seen before, with only one region
(or a couple) having any data. So even if you have 21 regions, only mapper
might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of
mappers. Writing a combiner is a best practice and you should almost always
have one. Compression can be turned on by setting the following properties
in your job config.
property
name mapreduce.map.output.compress /name
value true/value
/property
property
namemapreduce.map.output.compress.codec/name
valueorg.apache.hadoop.io.compress.GzipCodec/value
/property
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
depending on your use cases. Gzip is really slow but gets the best
compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
compression ratio. If your computations are CPU bound, then you'd probably
want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
are idle, you can use Gzip. You'll have to experiment and find the best
settings for you. There are a lot of other tweaks that you can try to get
the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other
similar tool) on your cluster and monitor usage of resources while your job
is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster
and see if that fits your deployment.
http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then
there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep

On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee
rahul.rec@gmail.com wrote:

One mapper is spawned per hbase table region. You can use the admin ui of
hbase to find the number of regions per table. It might happen that all the
data is sitting in a single region , so a single mapper is spawned and you
are not getting enough parallel work getting done.

If that is the case then you can recreate the tables with predefined
splits to create more regions.

Thanks,
Rahul

On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netwrote:

Pavan,

How large are the rows in HBase? 22 million rows is not very much but
you mentioned “huge strings”. Can you tell which part of the processing is
the limiting factor (read from HBase, mapper output, reducers)?

John

** **

*From:* Pavan Sudheendra [mailto:pavan0...@gmail.com]
*Sent:* Saturday, September 21, 2013 2:17 AM
*To:* user@hadoop.apache.org
*Subject:* Re: How to best decide mapper output/reducer input for a
huge string?

** **

No, I don't have a combiner in place. Is it necessary? How do I make my
map output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it
somehow..

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
any thing wrong with my design? Does it require any kind of re-work? Thank
you so much for helping..

** **

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.com
wrote:

One thing that comes to mind is that your keys are Strings which are
highly inefficient. You might get a lot better performance if you write a
custom writable for your Key object using the appropriate data types. For
example, use a long (LongWritable) for timestamps. This should make
(de)serialization a lot faster. If HouseHoldId is an integer, your speed of
comparisons for sorting will also go up.

** **

Ensure that your map output's are being compressed. Are your tables in
HBase compressed? Do you have a combiner?

** **

Have you been able to profile your code to see where the bottlenecks are?

** **

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.com
wrote:

Hi Pradeep,

Yes.. Basically i'm only writing the key part as the map output.. The V
of K,V is

Re: How to best decide mapper output/reducer input for a huge string?

2013-09-23 Thread Pavan Sudheendra

@John, to be really frank i don't know what the limiting factor is.. It
might be all of them or a subset of them.. Cannot tell..

On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra pavan0...@gmail.comwrote:

On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota
pradeep...@gmail.comwrote:

Pavan,

If your cluster is already optimized and your job is not I/O bound, then
there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep

On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee
rahul.rec@gmail.com wrote:

One mapper is spawned per hbase table region. You can use the admin ui
of hbase to find the number of regions per table. It might happen that all
the data is sitting in a single region , so a single mapper is spawned and
you are not getting enough parallel work getting done.

If that is the case then you can recreate the tables with predefined
splits to create more regions.

Thanks,
Rahul

On Sun, Sep 22, 2013 at 4:38 AM, John Lilley
john.lil...@redpoint.netwrote:

Pavan,

John

** **

No, I don't have a combiner in place. Is it necessary? How do I make my
map output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it
somehow..

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you
see any thing wrong with my design? Does it require any kind of re-work?
Thank you so much for helping..

** **

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota
pradeep...@gmail.com wrote:

** **

Ensure that your map output's are being compressed. Are your tables in
HBase compressed? Do you have a combiner?

** **

Have you been able to profile your code to see where the

hadoop-2.1.0-beta - Namenode federation Configuration

2013-09-23 Thread Manickam P

Guys,

Sometime back i installed hadoop-2.1.0 beta version to see the Yarn features. 
now i want to configure with HDFS federation with this existing cluster setup? 
Is it possible? 
pls clarify this and help me to get some tutorials for this. 



Thanks,
Manickam P

RE: HDFS federation Configuration

2013-09-23 Thread Manickam P

Hi Suresh,

I'm not able to follow the page completely. 
Can you pls help me to get some clear step by step or little bit more details 
in the configuration side? 
I am trying to follow a 3 node machines with 2.1.0 beta version of hadoop. My 
intention is having 2 federated name node and 2 data nodes. 




Thanks,
Manickam P

Date: Thu, 19 Sep 2013 12:28:04 -0700
Subject: Re: HDFS federation Configuration
From: sur...@hortonworks.com
To: user@hadoop.apache.org

Have you looked at - 
http://hadoop.apache.org/docs/r2.1.0-beta/hadoop-project-dist/hadoop-hdfs/Federation.html

Let me know if the document is not clear or needs improvements.
Regards,Suresh



On Thu, Sep 19, 2013 at 12:01 PM, Manickam P manicka...@outlook.com wrote:





Guys,

I need some tutorials to configure fedration. Can you pls suggest me some?




Thanks,
Manickam P



-- 
 http://hortonworks.com/download/





CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the 
individual or entity to which it is addressed and may contain information that 
is confidential, privileged and exempt from disclosure under applicable law. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have received 
this communication in error, please contact the sender immediately and delete 
it from your system. Thank You.

Help running Hadoop 2.0.5 with Snappy compression

2013-09-23 Thread Elad Itzhakian

Hi,

I'm trying to run some MapReduce jobs on Hadoop 2.0.5 framework using Snappy 
compression.

I built Hadoop with -Pnative, installed it and Snappy on all 3 machines 
(master+2 slaves) and copied .so files as required to $HADOOP_HOME/lib/native

Also I added the following to $HADOOP_CONF_DIR/mapred-site.xml:
property
  namemapreduce.map.output.compress/name
  valuetrue/value
/property
property
  namemapred.map.output.compress.codec/name
  valueorg.apache.hadoop.io.compress.SnappyCodec/value
/property

And this I added to core-site.xml:

property
nameio.compression.codecs/name
value
  org.apache.hadoop.io.compress.GzipCodec,
  org.apache.hadoop.io.compress.DefaultCodec,
  org.apache.hadoop.io.compress.BZip2Codec,
  org.apache.hadoop.io.compress.SnappyCodec
/value
  /property


I then ran the following jobs:

Pi:
bin/hadoop jar 
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.5-alpha.jar pi 8 2000

Teragen  Terasort:
bin/hadoop jar 
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.5-alpha.jar teragen 10 
/in
bin/hadoop jar 
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.5-alpha.jar terasort /in 
/out


But when grepping the logs I can't seem to find any sign of Snappy:

[eladi@r-zorro003 hadoop-2.0.5-alpha]$ grep -r Snappy /data1/elad/logs/*
[eladi@r-zorro003 hadoop-2.0.5-alpha]$ grep -r compress /data1/elad/logs/*
/data1/elad/logs/application_1379926544427_0001/container_1379926544427_0001_01_10/syslog:2013-09-23
 11:56:24,182 INFO [fetcher#5] org.apache.hadoop.io.compress.zlib.ZlibFactory: 
Successfully loaded  initialized native-zlib library
/data1/elad/logs/application_1379926544427_0001/container_1379926544427_0001_01_10/syslog:2013-09-23
 11:56:24,183 INFO [fetcher#5] org.apache.hadoop.io.compress.CodecPool: Got 
brand-new decompressor [.deflate]
/data1/elad/logs/application_1379926544427_0001/container_1379926544427_0001_01_10/syslog:2013-09-23
 11:56:24,331 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got 
brand-new compressor [.deflate]

It seems as if Zlib is being loaded instead of Snappy.

What am I missing?

Thanks,
Elad Itzhakian

Distributed cache in command line

2013-09-23 Thread Chandra Mohan, Ananda Vel Murugan

Hi,

Is it possible to access distributed cache in command line? I have written a 
custom InputFormat implementation which I want to add to distributed cache. 
Using libjars is not an option for me as I am not running Hadoop job in command 
line. I am running it using RHadoop package in R which internally uses Hadoop 
streaming. Please help. Thanks.

Regards,
Anand.C

Error while configuring HDFS fedration

2013-09-23 Thread Manickam P

Guys,

I'm trying to configure HDFS federation with 2.1.0 beta version. I am having 3 
machines in that i want to have two name nodes and one data node. 

I have done the other thing like password less ssh and host entries properly. 
when i start the cluster i'm getting the below error. 

In node one i'm getting this error. 
java.net.BindException: Port in use: lab-hadoop.eng.com:50070

In another node i'm getting this error. 
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage 
directory does not exist or is not accessible.

My core-site xml has the below. 
configuration
  property
namefs.default.name/name
valuehdfs://10.101.89.68:9000/value
  /property
  property
namehadoop.tmp.dir/name
value/home/lab/hadoop-2.1.0-beta/tmp/value
  /property
/configuration

My hdfs-site xml has the below.
configuration
   property
 namedfs.replication/name
 value2/value
   /property
   property
 namedfs.permissions/name
 valuefalse/value
   /property
   property
namedfs.federation.nameservices/name
valuens1,ns2/value
/property
property
namedfs.namenode.rpc-address.ns1/name
value10.101.89.68:9001/value
/property
   property
namedfs.namenode.http-address.ns1/name
value10.101.89.68:50070/value
   /property
   property
namedfs.namenode.secondary.http-address.ns1/name
value10.101.89.68:50090/value
/property
property
namedfs.namenode.rpc-address.ns2/name
value10.101.89.69:9001/value
/property
   property
namedfs.namenode.http-address.ns2/name
value10.101.89.69:50070/value
   /property
   property
namedfs.namenode.secondary.http-address.ns2/name
value10.101.89.69:50090/value
/property
 /configuration

Please help me to fix this error. 


Thanks,
Manickam P

RE: Error while configuring HDFS fedration

2013-09-23 Thread Elad Itzhakian

Ports in use may result from actual processes using them, or just ghost 
processes. The second error may be caused by inconsistent permissions on 
different nodes, and/or a format is needed on DFS.

I suggest the following:


1.   sbin/stop-dfs.sh  sbin/stop-yarn.sh

2.   sudo killall java (on all nodes)

3.   sudo chmod -R 755 /home/lab/hadoop-2.1.0-beta/tmp/dfs (on all nodes)

4.   sudo rm -rf /home/lab/hadoop-2.1.0-beta/tmp/dfs/* (on all nodes)

5.   bin/hdfs namenode -format -force

6.   sbin/start-dfs.sh  sbin/start-yarn.sh

Then see if you get that error again.

From: Manickam P [mailto:manicka...@outlook.com]
Sent: Monday, September 23, 2013 4:44 PM
To: user@hadoop.apache.org
Subject: Error while configuring HDFS fedration

Guys,

I'm trying to configure HDFS federation with 2.1.0 beta version. I am having 3 
machines in that i want to have two name nodes and one data node.

I have done the other thing like password less ssh and host entries properly. 
when i start the cluster i'm getting the below error.

In node one i'm getting this error.
java.net.BindException: Port in use: lab-hadoop.eng.com:50070

In another node i'm getting this error.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage 
directory does not exist or is not accessible.

My core-site xml has the below.
configuration
  property
namefs.default.name/name
valuehdfs://10.101.89.68:9000/value
  /property
  property
namehadoop.tmp.dir/name
value/home/lab/hadoop-2.1.0-beta/tmp/value
  /property
/configuration

My hdfs-site xml has the below.
configuration
   property
 namedfs.replication/name
 value2/value
   /property
   property
 namedfs.permissions/name
 valuefalse/value
   /property
   property
namedfs.federation.nameservices/name
valuens1,ns2/value
/property
property
namedfs.namenode.rpc-address.ns1/name
value10.101.89.68:9001/value
/property
   property
namedfs.namenode.http-address.ns1/name
value10.101.89.68:50070/value
   /property
   property
namedfs.namenode.secondary.http-address.ns1/name
value10.101.89.68:50090/value
/property
property
namedfs.namenode.rpc-address.ns2/name
value10.101.89.69:9001/value
/property
   property
namedfs.namenode.http-address.ns2/name
value10.101.89.69:50070/value
   /property
   property
namedfs.namenode.secondary.http-address.ns2/name
value10.101.89.69:50090/value
/property
 /configuration

Please help me to fix this error.


Thanks,
Manickam P

RE: Error while configuring HDFS fedration

2013-09-23 Thread Manickam P

Hi,

I followed your steps. That bind error got resolved but still i'm getting the 
second exception. I've given the complete stack below. 

2013-09-23 10:26:01,887 INFO org.mortbay.log: Stopped 
selectchannelconnec...@lab2-hadoop2-vm1.eng.com:50070
2013-09-23 10:26:01,988 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping NameNode metrics system...
2013-09-23 10:26:01,989 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
NameNode metrics system stopped.
2013-09-23 10:26:01,990 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
NameNode metrics system shutdown complete.
2013-09-23 10:26:01,991 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: 
Exception in namenode join
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage 
directory does not exist or is not accessible.
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:292)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:200)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:777)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:558)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:418)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:466)
at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:659)
at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:644)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1221)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1287)
2013-09-23 10:26:02,001 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
2013-09-23 10:26:02,018 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
SHUTDOWN_MSG: 

Thanks,
Manickam P

From: el...@mellanox.com
To: user@hadoop.apache.org
Subject: RE: Error while configuring HDFS fedration
Date: Mon, 23 Sep 2013 14:05:47 +









Ports in use may result from actual processes using them, or just ghost 
processes. The second error may be caused by inconsistent permissions on 
different nodes,
 and/or a format is needed on DFS.
 
I suggest the following:
 
1.  
sbin/stop-dfs.sh  sbin/stop-yarn.sh
2.  
sudo killall java
(on all nodes)
3.  
sudo chmod –R 755 /home/lab/hadoop-2.1.0-beta/tmp/dfs
(on all nodes)
4.  
sudo rm –rf /home/lab/hadoop-2.1.0-beta/tmp/dfs/*
(on all nodes)
5.  
bin/hdfs namenode –format –force

6.  
sbin/start-dfs.sh  sbin/start-yarn.sh
 
Then see if you get that error again.
 


From: Manickam P [mailto:manicka...@outlook.com]


Sent: Monday, September 23, 2013 4:44 PM

To: user@hadoop.apache.org

Subject: Error while configuring HDFS fedration


 

Guys,



I'm trying to configure HDFS federation with 2.1.0 beta version. I am having 3 
machines in that i want to have two name nodes and one data node.




I have done the other thing like password less ssh and host entries properly. 
when i start the cluster i'm getting the below error.




In node one i'm getting this error. 

java.net.BindException: Port in use: lab-hadoop.eng.com:50070



In another node i'm getting this error.


org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage 
directory does not exist or is not accessible.



My core-site xml has the below. 

configuration

  property

namefs.default.name/name

valuehdfs://10.101.89.68:9000/value

  /property

  property

namehadoop.tmp.dir/name

value/home/lab/hadoop-2.1.0-beta/tmp/value

  /property

/configuration



My hdfs-site xml has the below.

configuration

   property

 namedfs.replication/name

 value2/value

   /property

   property

 namedfs.permissions/name

 valuefalse/value

   /property

   property

namedfs.federation.nameservices/name

valuens1,ns2/value

/property

property

namedfs.namenode.rpc-address.ns1/name

value10.101.89.68:9001/value

/property

   property

namedfs.namenode.http-address.ns1/name

value10.101.89.68:50070/value

   /property

   property

namedfs.namenode.secondary.http-address.ns1/name

value10.101.89.68:50090/value

/property

property

namedfs.namenode.rpc-address.ns2/name

value10.101.89.69:9001/value

/property

   property

namedfs.namenode.http-address.ns2/name

value10.101.89.69:50070/value

   /property

   property

namedfs.namenode.secondary.http-address.ns2/name

value10.101.89.69:50090/value

/property

 /configuration



Please help me to fix this error. 





Thanks,

Manickam P

RE: Error while configuring HDFS fedration

2013-09-23 Thread Manickam P

Hi, 

Thanks for your inputs. I fixed the issue. 

Thanks,
Manickam P

From: el...@mellanox.com
To: user@hadoop.apache.org
Subject: RE: Error while configuring HDFS fedration
Date: Mon, 23 Sep 2013 14:05:47 +

Ports in use may result from actual processes using them, or just ghost 
processes. The second error may be caused by inconsistent permissions on 
different nodes,
 and/or a format is needed on DFS.

I suggest the following:

1.  
sbin/stop-dfs.sh  sbin/stop-yarn.sh
2.  
sudo killall java
(on all nodes)
3.  
sudo chmod –R 755 /home/lab/hadoop-2.1.0-beta/tmp/dfs
(on all nodes)
4.  
sudo rm –rf /home/lab/hadoop-2.1.0-beta/tmp/dfs/*
(on all nodes)
5.  
bin/hdfs namenode –format –force

6.  
sbin/start-dfs.sh  sbin/start-yarn.sh

Then see if you get that error again.

From: Manickam P [mailto:manicka...@outlook.com]

Sent: Monday, September 23, 2013 4:44 PM

To: user@hadoop.apache.org

Subject: Error while configuring HDFS fedration

Guys,

I'm trying to configure HDFS federation with 2.1.0 beta version. I am having 3 
machines in that i want to have two name nodes and one data node.

I have done the other thing like password less ssh and host entries properly. 
when i start the cluster i'm getting the below error.

In node one i'm getting this error. 

java.net.BindException: Port in use: lab-hadoop.eng.com:50070

In another node i'm getting this error.

org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage 
directory does not exist or is not accessible.

My core-site xml has the below. 

configuration

  property

namefs.default.name/name

valuehdfs://10.101.89.68:9000/value

  /property

  property

namehadoop.tmp.dir/name

value/home/lab/hadoop-2.1.0-beta/tmp/value

  /property

/configuration

My hdfs-site xml has the below.

configuration

   property

 namedfs.replication/name

 value2/value

   /property

   property

 namedfs.permissions/name

 valuefalse/value

   /property

   property

namedfs.federation.nameservices/name

valuens1,ns2/value

/property

property

namedfs.namenode.rpc-address.ns1/name

value10.101.89.68:9001/value

/property

   property

namedfs.namenode.http-address.ns1/name

value10.101.89.68:50070/value

   /property

   property

namedfs.namenode.secondary.http-address.ns1/name

value10.101.89.68:50090/value

/property

property

namedfs.namenode.rpc-address.ns2/name

value10.101.89.69:9001/value

/property

   property

namedfs.namenode.http-address.ns2/name

value10.101.89.69:50070/value

   /property

   property

namedfs.namenode.secondary.http-address.ns2/name

value10.101.89.69:50090/value

/property

 /configuration

Please help me to fix this error. 

Thanks,

Manickam P

RE: HDFS federation Configuration

2013-09-23 Thread Manickam P

Hi,
Thanks. I have done that. 

Thanks,
Manickam P
Date: Mon, 23 Sep 2013 10:02:42 -0700
Subject: Re: HDFS federation Configuration
From: sur...@hortonworks.com
To: user@hadoop.apache.org

I'm not able to follow the page completely. 

Can you pls help me to get some clear step by step or little bit more details 
in the configuration side? 

Have you setup a non-federated cluster before. If you have, the page should be 
easy to follow. If you have not setup a non-federated cluster before, I suggest 
doing so, before looking at this document.

I think the document already has step by step instructions.I

CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the 
individual or entity to which it is addressed and may contain information that 
is confidential, privileged and exempt from disclosure under applicable law. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have received 
this communication in error, please contact the sender immediately and delete 
it from your system. Thank You.

RE: HDFS federation Configuration

2013-09-23 Thread Manickam P

Hi,
I got couple of doubts while doing this. Please help me to understand that.
How to generate or find out the the cluster id for the below step? I saw that 
for one node we can format without this cluster id and we need to pass same 
cluster id which got generated for the other name node while formatting or else 
we could not get the federated cluster. $HADOOP_PREFIX_HOME/bin/hdfs namenode 
-format -clusterId cluster_idIn my case i want to have two name nodes and 3 
data nodes. Do i need to mention all the data nodes ip address in the slave 
files in two name nodes? 

Thanks,Manickam P

Date: Mon, 23 Sep 2013 10:02:42 -0700
Subject: Re: HDFS federation Configuration
From: sur...@hortonworks.com
To: user@hadoop.apache.org



I'm not able to follow the page completely. 

Can you pls help me to get some clear step by step or little bit more details 
in the configuration side? 

Have you setup a non-federated cluster before. If you have, the page should be 
easy to follow. If you have not setup a non-federated cluster before, I suggest 
doing so, before looking at this document.

I think the document already has step by step instructions.I





CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the 
individual or entity to which it is addressed and may contain information that 
is confidential, privileged and exempt from disclosure under applicable law. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have received 
this communication in error, please contact the sender immediately and delete 
it from your system. Thank You.

RE: How to best decide mapper output/reducer input for a huge string?

2013-09-23 Thread John Lilley

You might try creating a stub MR job in which the mappers produce no output; 
that would isolate the time spent reading from HBase without the trouble of 
instrumenting your code.
John

From: Pavan Sudheendra [mailto:pavan0...@gmail.com]
Sent: Monday, September 23, 2013 3:31 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

@John, to be really frank i don't know what the limiting factor is.. It might 
be all of them or a subset of them.. Cannot tell..

On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra 
pavan0...@gmail.commailto:pavan0...@gmail.com wrote:
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are 
functional at the same time.. Although, @Pradeep, i should do the compression 
like you say.. I'll give it a shot.. As far as i can see, i think i'll need to 
implement Writable and write out the key of the mapper using the specific data 
types instead of writing it out as a string which might slow the operation 
down..

On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota 
pradeep...@gmail.commailto:pradeep...@gmail.com wrote:
Pavan,

It's hard to tell whether there's anything wrong with your design or not since 
you haven't given us specific enough details. The best thing you can do is 
instrument your code and see what is taking a long time. Rahul mentioned a 
problem that I myself have seen before, with only one region (or a couple) 
having any data. So even if you have 21 regions, only mapper might be doing the 
heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of mappers. 
Writing a combiner is a best practice and you should almost always have one. 
Compression can be turned on by setting the following properties in your job 
config.
property
name mapreduce.map.output.compress /name
value true/value
/property
property
namemapreduce.map.output.compress.codec/name
valueorg.apache.hadoop.io.compress.GzipCodec/value
/property
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. 
depending on your use cases. Gzip is really slow but gets the best compression 
ratios. Snappy/Lzo are a lot faster but don't have as good of a compression 
ratio. If your computations are CPU bound, then you'd probably want to use 
Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can 
use Gzip. You'll have to experiment and find the best settings for you. There 
are a lot of other tweaks that you can try to get the best performance out of 
your cluster.

One of the best things you can do is to install Ganglia (or some other similar 
tool) on your cluster and monitor usage of resources while your job is running. 
This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and see 
if that fits your deployment. 
http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then there 
might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep

On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee 
rahul.rec@gmail.commailto:rahul.rec@gmail.com wrote:
One mapper is spawned per hbase table region. You can use the admin ui of hbase 
to find the number of regions per table. It might happen that all the data is 
sitting in a single region , so a single mapper is spawned and you are not 
getting enough parallel work getting done.
If that is the case then you can recreate the tables with predefined splits to 
create more regions.
Thanks,
Rahul

On Sun, Sep 22, 2013 at 4:38 AM, John Lilley 
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you 
mentioned huge strings.  Can you tell which part of the processing is the 
limiting factor (read from HBase, mapper output, reducers)?
John

From: Pavan Sudheendra [mailto:pavan0...@gmail.commailto:pavan0...@gmail.com]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map 
output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire 
table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any 
thing wrong with my design? Does it require any kind of re-work? Thank you so 
much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota 
pradeep...@gmail.commailto:pradeep...@gmail.com wrote:
One thing that comes to mind is that your keys are Strings which are highly 
inefficient. You might get a lot better performance if you write a custom

Re: How to make hadoop use all nodes?

2013-09-23 Thread Antoine Vandecreme

Hi Omkar,

(which has 40 containers slots.)  for total cluster?
Yes, it was just an hypotetical value though.
Below are my real configurations.

1) yarn-site.xml - what is the resource memory configured for per node?
12288mb

2) yarn-site.xml - what is the minimum resource allocation for the 
cluster?
1024mb min
12288mb max

I also have those memory configurations in mapred-site.xml :
  property
namemapreduce.map.memory.mb/name
value5000/value
  /property

  property
namemapreduce.map.java.opts/name
value-Xmx4g -Djava.awt.headless=true/value
  /property

  property
namemapreduce.reduce.memory.mb/name
value5000/value
  /property

  property
namemapreduce.reduce.java.opts/name
value-Xmx4g -Djava.awt.headless=true/value
  /property

3) yarn-resource-manager-log  (while starting resource manager export 
YARN_ROOT_LOGGER=DEBUG,RFA).. I am looking for debug logs..
The resulting log is really verbose. Are you searching for something in 
particular?

4) On RM UI how much total cluster memory is reported (how many total 
nodes). ( RM UI click on Cluster)
So I have 58 active nodes and total memory reported is 696GB which is 
58x12 as expected.
I have 93 containers running instead of 116 I would expect (my job has 
2046 maps so it could use all 116 containers).

Here is a copy past of what I have in the scheduler tab:


*Queue State: *
RUNNING 
*Used Capacity: *
99.4% 
*Absolute Capacity: *
100.0% 
*Absolute Max Capacity: *
100.0% 
*Used Resources: *

*Num Active Applications: *
1 
*Num Pending Applications: *
0 
*Num Containers: *
139 
*Max Applications: *
1 
*Max Applications Per User: *
1 
*Max Active Applications: *
70 
*Max Active Applications Per User: *
70 
*Configured Capacity: *
100.0% 
*Configured Max Capacity: *
100.0% 
*Configured Minimum User Limit Percent: *
100% 
*Configured User Limit Factor: *
1.0 
*Active users: *
xxx Memory: 708608 (100.00%), vCores: 139 (100.00%), Active Apps: 1, 
Pending Apps: 0


I don't know where the 139 containers value is comming from.

5) which scheduler you are using? Capacity/Fair/FIFO
I did not set yarn.resourcemanager.scheduler.class so apparently the 
default is Capacity.

6) have you configured any user limits/ queue capacity? (please add 
details).
No.

7) All requests you are making at same priority or with different priorities? 
(Ideally it will not matter but want to know).
I don't set any priority.

Thanks for your help.

Antoine Vandecreme

On Friday, September 20, 2013 12:20:38 PM Omkar Joshi wrote:
 Hi,
 
 few more questions
 
 (which has 40 containers slots.)  for total cluster? Please give below
 details
 
 for cluster
 1) yarn-site.xml - what is the resource memory configured for per node?
 2) yarn-site.xml - what is the minimum resource allocation for the 
cluster?
 3) yarn-resource-manager-log  (while starting resource manager export
 YARN_ROOT_LOGGER=DEBUG,RFA).. I am looking for debug logs..
 4) On RM UI how much total cluster memory is reported (how many total
 nodes). ( RM UI click on Cluster)
 5) which scheduler you are using? Capacity/Fair/FIFO
 6) have you configured any user limits/ queue capacity? (please add
 details).
 7) All requests you are making at same priority or with different
 priorities? (Ideally it will not matter but want to know).
 
 Please let us know all the above details. Thanks.
 
 
 Thanks,
 Omkar Joshi
 *Hortonworks Inc.* http://www.hortonworks.com
 
 
 On Fri, Sep 20, 2013 at 6:55 AM, Antoine Vandecreme 
 
 antoine.vandecr...@nist.gov wrote:
  Hello Omkar,
  
  Thanks for your reply.
  
  Yes, all 4 points are corrects.

Re: Distributed cache in command line

2013-09-23 Thread Omkar Joshi

Hi,

I have no idea about RHadoop but in general in YARN we do create symlinks
for the files in distributed cache in the current working directory of
every container. You may be able to use that somehow.

Thanks,
Omkar Joshi
*Hortonworks Inc.* http://www.hortonworks.com


On Mon, Sep 23, 2013 at 6:28 AM, Chandra Mohan, Ananda Vel Murugan 
ananda.muru...@honeywell.com wrote:

  Hi, 

 ** **

 Is it possible to access distributed cache in command line? I have written
 a custom InputFormat implementation which I want to add to distributed
 cache. Using *libjars *is not an option for me as I am not running Hadoop
 job in command line. I am running it using RHadoop package in R which
 internally uses Hadoop streaming. Please help. Thanks. 

 ** **

 Regards,

 Anand.C


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: HDFS federation Configuration

2013-09-23 Thread Suresh Srinivas


 I'm not able to follow the page completely.
 Can you pls help me to get some clear step by step or little bit more
 details in the configuration side?


Have you setup a non-federated cluster before. If you have, the page should
be easy to follow. If you have not setup a non-federated cluster before, I
suggest doing so, before looking at this document.

I think the document already has step by step instructions.

 I


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

mapred Configuration Vs mapReduce configuration in Ooozie

2013-09-23 Thread KayVajj

I am implementing an oozie workflow MR job . I was not sure if I have to
use the mapred properties vs the mapreduce properties.

I have seen that if mapreduce properties were to be used, it has to be set
explicitly like

JobConf.setUseNewMapper(boolean flag)


How do I do the same in Oozie?

Also another question, if I were to use the new properties where do I find
the property names like mapreduce.job.inputformat.class These seem to be
hidden in some private API.

Thanks your answers are appreciated.

Re: mapred Configuration Vs mapReduce configuration in Ooozie

2013-09-23 Thread Robert Kanter

Hi,

You can use the new api in your action if you set these two properties in
the action's configuration section:
property
  namemapred.mapper.new-api/name
  valuetrue/value
/property
property
  namemapred.reducer.new-api/name
  valuetrue/value
/property

The JobConf.setUseNewMapper(boolean flag) simply sets the same property;
same with JobConf.setUseNewReducer(boolean flag).


 - Robert


On Mon, Sep 23, 2013 at 2:39 PM, KayVajj vajjalak...@gmail.com wrote:

 I am implementing an oozie workflow MR job . I was not sure if I have to
 use the mapred properties vs the mapreduce properties.

 I have seen that if mapreduce properties were to be used, it has to be set
 explicitly like

 JobConf.setUseNewMapper(boolean flag)


 How do I do the same in Oozie?

 Also another question, if I were to use the new properties where do I find
 the property names like mapreduce.job.inputformat.class These seem to be
 hidden in some private API.

 Thanks your answers are appreciated.


  --

 ---
 You received this message because you are subscribed to the Google Groups
 CDH Users group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to cdh-user+unsubscr...@cloudera.org.
 For more options, visit
 https://groups.google.com/a/cloudera.org/groups/opt_out.

Re: mapred Configuration Vs mapReduce configuration in Ooozie

2013-09-23 Thread Mohammad Islam

Hi Kay,
Do you want to use the new hadoop API?
If yes, did you check this link ?
https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook

CASE-2: RUNNING MAPREDUCE USING THE NEW HADOOP API

For your second questions,
I don't know any good such link. But this could be helpful :
http://stackoverflow.com/questions/10986633/hadoop-configuration-mapred-vs-mapreduce

Regards,
Mohammad



 From: KayVajj vajjalak...@gmail.com
To: cdh-u...@cloudera.org; common-u...@hadoop.apache.org 
user@hadoop.apache.org 
Sent: Monday, September 23, 2013 2:39 PM
Subject: mapred Configuration Vs mapReduce configuration in Ooozie
 


I am implementing an oozie workflow MR job . I was not sure if I have to use 
the mapred properties vs the mapreduce properties. 

I have seen that if mapreduce properties were to be used, it has to be set 
explicitly like

JobConf.setUseNewMapper(boolean flag)

How do I do the same in Oozie? 

Also another question, if I were to use the new properties where do I find the 
property names like mapreduce.job.inputformat.class These seem to be hidden 
in some private API.

Thanks your answers are appreciated.

Re: mapred Configuration Vs mapReduce configuration in Ooozie

2013-09-23 Thread Mohammad Islam

+user@oozie 

- cdh-user (can't send)

 From: Mohammad Islam misla...@yahoo.com
To: user@hadoop.apache.org user@hadoop.apache.org; cdh-u...@cloudera.org 
cdh-u...@cloudera.org 
Sent: Monday, September 23, 2013 3:24 PM
Subject: Re: mapred Configuration Vs mapReduce configuration in Ooozie

Hi Kay,
Do you want to use the new hadoop API?
If yes, did you check this link ?
https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook

CASE-2: RUNNING MAPREDUCE USING THE NEW HADOOP API

For your second questions,
I don't know any good such link. But this could be helpful :
http://stackoverflow.com/questions/10986633/hadoop-configuration-mapred-vs-mapreduce

Regards,
Mohammad

 From: KayVajj vajjalak...@gmail.com
To: cdh-u...@cloudera.org; common-u...@hadoop.apache.org 
user@hadoop.apache.org 
Sent: Monday, September 23, 2013 2:39 PM
Subject: mapred Configuration Vs mapReduce configuration in Ooozie

I am implementing an oozie workflow MR job . I was not sure if I have to use 
the mapred properties vs the mapreduce properties. 

I have seen that if mapreduce properties were to be used, it has to be set 
explicitly like

JobConf.setUseNewMapper(boolean flag)

How do I do the same in Oozie? 

Also another question, if I were to use the new properties where do I find the 
property names like mapreduce.job.inputformat.class These seem to be hidden 
in some private API.

Thanks your answers are appreciated.

Re: Task status query

2013-09-23 Thread Arun C Murthy

Yep, typically, the AM should pass it's host:port to the task as part of either 
the cmd-line for the task or in it's env. That is what is done by MR AM.

hth,
Arun

On Sep 21, 2013, at 6:52 AM, John Lilley john.lil...@redpoint.net wrote:

 Thanks Harsh!  The data-transport format is pretty easy, but how is the RPC 
 typically set up?  Does the AM open a listen port to accept the RPC from the 
 tasks, and then pass the port/URI to the tasks when they are spawned as 
 command-line or environment?
 john
 
 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com] 
 Sent: Friday, September 20, 2013 7:47 AM
 To: user@hadoop.apache.org
 Subject: Re: Task status query
 
 Right now its MR specific (TaskUmbilicalProtocol) - YARN doesn't have any 
 reusable items here yet, but there are easy to use RPC libs such as Avro and 
 Thrift out there that make it easy to do such things once you define what you 
 want in a schema/spec form.
 
 On Fri, Sep 20, 2013 at 5:32 PM, John Lilley john.lil...@redpoint.net wrote:
 Thanks Harsh.  Is this protocol something that is available to all 
 AMs/tasks?  Or is it up to each AM/task pair to develop their own protocol?
 john
 
 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Thursday, September 19, 2013 9:20 PM
 To: user@hadoop.apache.org
 Subject: Re: Task status query
 
 Hi John,
 
 YARN tasks can be more than simple executables. In case of MR, for example, 
 tasks talk to the AM and report their individual progress and counters back 
 to it, via a specific protocol (over the network), giving the AM more data 
 to compute an near-accurate global progress.
 
 On Fri, Sep 20, 2013 at 12:18 AM, John Lilley john.lil...@redpoint.net 
 wrote:
 How does a YARN application master typically query ongoing status 
 (like percentage completion) of its tasks?
 
 I would like to be able to ultimately relay information to the user like:
 
 100 tasks are scheduled
 
 10 tasks are complete
 
 4 tasks are running and they are (4%, 10%, 50%, 70%) complete
 
 But, given that YARN tasks are simply executables, how can the AM 
 even get at this information?  Can the AM get access to stdout/stderr?
 
 Thanks
 
 John
 
 
 
 
 
 --
 Harsh J
 
 
 
 --
 Harsh J

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Help writing a YARN application

2013-09-23 Thread Arun C Murthy

I looked at your code, it's using really old apis/protocols which are
significantly different now. See hadoop-2.1.0-beta release for latest
apis/protocols.

Also, you should really be using yarn client module rather than raw protocols.

See https://github.com/hortonworks/simple-yarn-app for a simple example.

thanks,
Arun

On Sep 20, 2013, at 11:24 AM, Pradeep Gollakota pradeep...@gmail.com wrote:

Hi All,

I've been trying to write a Yarn application and I'm completely lost. I'm
using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my sample
code to github at https://github.com/pradeepg26/sample-yarn

The problem is that my application master is exiting with a status of 1 (I'm
expecting that since my code isn't complete yet). But I have no logs that I
can examine. So I'm not sure if the error I'm getting is the error I'm
expecting. I've attached the nodemanger and resourcemanager logs for your
reference as well.

How can I get started on writing YARN applications beyond the initial
tutorial?

Thanks for any help/pointers!

Pradeep
yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.logyarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

Re: Queues in Fair Scheduler

2013-09-23 Thread Arun C Murthy

Please don't cross-post.

On Sep 22, 2013, at 11:19 PM, Anurag Tangri tangri.anu...@gmail.com wrote:

 Hi,
 We are trying to implement queues in fair scheduler using mapred acls.
 
 If I setup queues and try to use them in fair scheduler, then if I don't add 
 following two properties to the job, the job fails:
 
 -Dmapred.job.queue.name=queue name
 -D mapreduce.job.acl-view-job=* 
 
 
 Is that correct ? If yes, is there a way to set a default queue per user and 
 similarly better way to set acl-view property ?
 
 
 Thanks,
 Anurag Tangri

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

hadoop infinispan view not detected inside Map/Reduce

2013-09-23 Thread Cornelio Iñigo

Hi all



I want to use infinispan and hadoop. I made a simple test (wordcount
example).



I have this type of mode on my infinispan xml

*clustering mode=distribution*

* hash numOwners=2/*



and also this



*clustering mode=replication*





*First*: I create a folder and upload a file X in my infinispan cluster (on
each machine I can access the grid and the file is found).

*Second*: on my hadoop job, on the main class I can access the grid and
see the X file that I upload .

*Third*: I want to get the X file on my Map and/or Reduce class of the job
but the X file is not found.



As I understand hadoop TaskTracker runs the maps and reduce as a Child task
on a separate JVM.



Is that why I cant access the infinispan grid from the Map or Reduce class? If
so, any ideas how can I get the file in the map/reduce?

Do I need a special infinispan configuration for this?


Thanks
-- 
*Cornelio*

RE: How to best decide mapper output/reducer input for a huge string?

2013-09-23 Thread Pavan Sudheendra

No, I'm pretty sure the job is executing fine.. Just that the time it takes
to complete the whole process, is too much that's all..

I didn't mean to say the mapper or the reducer doesn't work.. Just that
it's very slow and I'm trying to figure out where it's happening in my
code.

Regards,
Pavan
On Sep 23, 2013 11:49 PM, John Lilley john.lil...@redpoint.net wrote:

You might try creating a “stub” MR job in which the mappers produce no
output; that would isolate the time spent reading from HBase without the
trouble of instrumenting your code.

John

** **

*From:* Pavan Sudheendra [mailto:pavan0...@gmail.com]
*Sent:* Monday, September 23, 2013 3:31 AM
*To:* user@hadoop.apache.org
*Subject:* Re: How to best decide mapper output/reducer input for a huge
string?

** **

@John, to be really frank i don't know what the limiting factor is.. It
might be all of them or a subset of them.. Cannot tell..

** **

On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra pavan0...@gmail.com
wrote:

** **

On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota pradeep...@gmail.com
wrote:

Pavan,

** **

property

name mapreduce.map.output.compress /name

value true/value

/property

property

namemapreduce.map.output.compress.codec/name

valueorg.apache.hadoop.io.compress.GzipCodec/value

/property

You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
depending on your use cases. Gzip is really slow but gets the best
compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
compression ratio. If your computations are CPU bound, then you'd probably
want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
are idle, you can use Gzip. You'll have to experiment and find the best
settings for you. There are a lot of other tweaks that you can try to get
the best performance out of your cluster.

** **

If your cluster is already optimized and your job is not I/O bound, then
there might be a problem with your algorithm and might warrant a redesign.

** **

Hope this helps!

- Pradeep

** **

On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee
rahul.rec@gmail.com wrote:

If that is the case then you can recreate the tables with predefined
splits to create more regions.

Thanks,

Rahul

** **

On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.net
wrote:

Pavan,

How large are the rows in HBase? 22 million rows is not very much but you
mentioned “huge strings”. Can you tell which part of the processing is the
limiting factor (read from HBase, mapper output, reducers)?

John

No, I don't have a combiner in place. Is it necessary? How do I make my
map output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real

Mapreduce jobtracker recover property

2013-09-23 Thread Viswanathan J

Hi,

I'm using the version Hadoop 1.2.1 in production hdfs, i can see the the
following properties with the value in jobtracker  job.xml,

*mapreduce.job.restart.recover - true*
*mapred.jobtracker.restart.recover - false**
*

What is the difference and which property will taken by the jobtracker? Do
we need really need these properties?

Also I didn't set the value to those properties in any of the
configurations(eg., mapred-site.xml).

Please help on this.

-- 
Regards,
Viswa.J

Re: Mapreduce jobtracker recover property

2013-09-23 Thread Shahab Yunus

*mapred.jobtracker.restart.recover *is the old API, while the other one is
for new. It is used to specify whether the job should try to resume at
recovering time and when restarting. If you don't want to use it then the
default value of false is used (specified in the already packaged/bundled
config file in the distribution jars.)

https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-5/running-on-a-cluster

Regards,
Shahab


On Mon, Sep 23, 2013 at 10:31 PM, Viswanathan J
jayamviswanat...@gmail.comwrote:

 Hi,

 I'm using the version Hadoop 1.2.1 in production hdfs, i can see the the
 following properties with the value in jobtracker  job.xml,

 *mapreduce.job.restart.recover - true*
 *mapred.jobtracker.restart.recover - false**
 *

 What is the difference and which property will taken by the jobtracker? Do
 we need really need these properties?

 Also I didn't set the value to those properties in any of the
 configurations(eg., mapred-site.xml).

 Please help on this.

 --
 Regards,
 Viswa.J

Re: Mapreduce jobtracker recover property

2013-09-23 Thread Shahab Yunus

*mapred.jobtracker.restart.recover *is the old API, while the other one is
for new. It is used to specify whether the job should try to resume at
recovering time and when restarting. If you don't want to use it then the
default value of false is used (specified in the already packaged/bundled
config file in the distribution jars.)

https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-5/running-on-a-cluster

Regards,
Shahab


On Mon, Sep 23, 2013 at 10:31 PM, Viswanathan J
jayamviswanat...@gmail.comwrote:

 Hi,

 I'm using the version Hadoop 1.2.1 in production hdfs, i can see the the
 following properties with the value in jobtracker  job.xml,

 *mapreduce.job.restart.recover - true*
 *mapred.jobtracker.restart.recover - false**
 *

 What is the difference and which property will taken by the jobtracker? Do
 we need really need these properties?

 Also I didn't set the value to those properties in any of the
 configurations(eg., mapred-site.xml).

 Please help on this.

 --
 Regards,
 Viswa.J

Re: Help writing a YARN application

2013-09-23 Thread Pradeep Gollakota

Hi Arun,

Thanks so much for your reply. I looked at the app you linked, it's way
easier to understand. Unfortunately, I can't use YarnClient class. It
doesn't seem to be in the 2.0.0 branch. Our production environment is using
cdh-4.x and it looks like they're tracking the 2.0.x branch.

However, I have looked at other implementations (notably Kitten and Giraph)
for inspiration and able to get my containers successfully launching.

Thanks for the help!
- Pradeep


On Mon, Sep 23, 2013 at 4:46 PM, Arun C Murthy a...@hortonworks.com wrote:

 I looked at your code, it's using really old apis/protocols which are
 significantly different now. See hadoop-2.1.0-beta release for latest
 apis/protocols.

 Also, you should really be using yarn client module rather than raw
 protocols.

 See https://github.com/hortonworks/simple-yarn-app for a simple example.

 thanks,
 Arun

 On Sep 20, 2013, at 11:24 AM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 Hi All,

 I've been trying to write a Yarn application and I'm completely lost. I'm
 using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my
 sample code to github at https://github.com/pradeepg26/sample-yarn

 The problem is that my application master is exiting with a status of 1
 (I'm expecting that since my code isn't complete yet). But I have no logs
 that I can examine. So I'm not sure if the error I'm getting is the error
 I'm expecting. I've attached the nodemanger and resourcemanager logs for
 your reference as well.

 How can I get started on writing YARN applications beyond the initial
 tutorial?

 Thanks for any help/pointers!

 Pradeep
 yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.log
 yarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log


 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.

Re: Mapreduce jobtracker recover property

2013-09-23 Thread Viswanathan J

Hi Shahab,

Thanks for the detailed response.

So the new property will have the default value as true in the package?

Thanks,
Viswa.J
On Sep 24, 2013 8:33 AM, Shahab Yunus shahab.yu...@gmail.com wrote:

 *mapred.jobtracker.restart.recover *is the old API, while the other one is
 for new. It is used to specify whether the job should try to resume at
 recovering time and when restarting. If you don't want to use it then the
 default value of false is used (specified in the already packaged/bundled
 config file in the distribution jars.)


 https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-5/running-on-a-cluster

 Regards,
 Shahab


 On Mon, Sep 23, 2013 at 10:31 PM, Viswanathan J 
 jayamviswanat...@gmail.com wrote:

 Hi,

 I'm using the version Hadoop 1.2.1 in production hdfs, i can see the the
 following properties with the value in jobtracker  job.xml,

 *mapreduce.job.restart.recover - true*
 *mapred.jobtracker.restart.recover - false**
 *

 What is the difference and which property will taken by the jobtracker?
 Do we need really need these properties?

 Also I didn't set the value to those properties in any of the
 configurations(eg., mapred-site.xml).

 Please help on this.

 --
 Regards,
 Viswa.J

Re: Help writing a YARN application

2013-09-23 Thread Personal

Hi Pradeep,




You can also look at Weave - A simpler thread abstraction over YARN. 




http://github.com/continuuity/weave




Runs on multiple HDs.




Nitin

On Mon, Sep 23, 2013 at 10:20 PM, Pradeep Gollakota pradeep...@gmail.com
wrote:

 Hi Arun,
 Thanks so much for your reply. I looked at the app you linked, it's way
 easier to understand. Unfortunately, I can't use YarnClient class. It
 doesn't seem to be in the 2.0.0 branch. Our production environment is using
 cdh-4.x and it looks like they're tracking the 2.0.x branch.
 However, I have looked at other implementations (notably Kitten and Giraph)
 for inspiration and able to get my containers successfully launching.
 Thanks for the help!
 - Pradeep
 On Mon, Sep 23, 2013 at 4:46 PM, Arun C Murthy a...@hortonworks.com wrote:
 I looked at your code, it's using really old apis/protocols which are
 significantly different now. See hadoop-2.1.0-beta release for latest
 apis/protocols.

 Also, you should really be using yarn client module rather than raw
 protocols.

 See https://github.com/hortonworks/simple-yarn-app for a simple example.

 thanks,
 Arun

 On Sep 20, 2013, at 11:24 AM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 Hi All,

 I've been trying to write a Yarn application and I'm completely lost. I'm
 using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my
 sample code to github at https://github.com/pradeepg26/sample-yarn

 The problem is that my application master is exiting with a status of 1
 (I'm expecting that since my code isn't complete yet). But I have no logs
 that I can examine. So I'm not sure if the error I'm getting is the error
 I'm expecting. I've attached the nodemanger and resourcemanager logs for
 your reference as well.

 How can I get started on writing YARN applications beyond the initial
 tutorial?

 Thanks for any help/pointers!

 Pradeep
 yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.log
 yarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log


 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.

38 matches

Mail list logo