how to get a core-site.xml info from a java application?

2011-01-25 Thread Jun Young Kim

Hi,

I am a beginner of a hadoop.
now I want to know a way to get my configuration information which are 
defined in *.xml on my applications.


for example)
$HADOOP_HOME/conf/core-site.xml


 fs.default.name
hdfs://localhost:54310



How I can use the "fs.default.name" information  in my application.
this is my source code.


Configuration conf = new Configuration();
System.out.println(conf.getString("fs.default.name"));
// print nothing.


How I can ??

--

Junyoung Kim (juneng...@gmail.com)



Re: how to get a core-site.xml info from a java application?

2011-01-25 Thread Harsh J
The Configuration statically loads property names and values from
resources (unless you ask it specifically not to). It first loads all
defaults (found in the Hadoop's common/core jar as core-default.xml)
and then loads core-site.xml from the CLASSPATH, if found.

If you can manage to put the core-site.xml containing directory (or
the file itself) onto the CLASSPATH of your application, it should
solve your issue as Configuration will locate that resource and load
it.

On Tue, Jan 25, 2011 at 3:36 PM, Jun Young Kim  wrote:
> Hi,
>
> I am a beginner of a hadoop.
> now I want to know a way to get my configuration information which are
> defined in *.xml on my applications.
>
> for example)
> $HADOOP_HOME/conf/core-site.xml
> 
> 
>  fs.default.name
> hdfs://localhost:54310
> 
> 
>
> How I can use the "fs.default.name" information  in my application.
> this is my source code.
>
> 
> Configuration conf = new Configuration();
> System.out.println(conf.getString("fs.default.name"));
> // print nothing.
> 
>
> How I can ??
>
> --
>
> Junyoung Kim (juneng...@gmail.com)
>
>



-- 
Harsh J
www.harshj.com


Map->Reduce->Reduce

2011-01-25 Thread Matthew John
Hi all,


I was working on a MapReduce program which does BytesWritable
dataprocessing. But currently I am basically running two MapReduces
consecutively to get the final output :

Input  (MapReduce1)---> Intermediate (MapReduce2)---> Output

Here I am running MapReduce2 only to sort the intermediate data on the basis
of a Key comparator logic.

I wanted to cut short the number of MapReduces to just one. I have figured
out a logic to do the same. But the only problem is that in my  logic I need
to run a sort on the Reduce output to get the  final output. the flow looks
like this :

Input (MapReduce1)> Output (not sorted)

I want to know if its possible to attach one more Reduce module to the
dataflow so that it can perform the inherent sort before the 2nd reduce
call. It would look like :

Input --(Map)---> MapOutput ---(Reduce1)-->Output (not sorted) ---(Reduce2 -
for which Reduce 1 acts as a Mapper)---> Output

Please let me know  if  there can be some means of sorting the output
without invoking a separate MapReduce just for the sake of sorting it .

Thanks ,
Matthew


Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-25 Thread Renaud Delbru

Hi,

we would like to limit the number of maximum tasks per job on our hadoop 
0.20.2 cluster.
Is the Capacity Scheduler [1] will allow to do this ? Is it correctly 
working on hadoop 0.20.2 (I remember a  few months ago, we were looking 
at it, but it seemed incompatible with hadoop 0.20.2).


[1] http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.html

Regards,
--
Renaud Delbru


Re: SSH problem in hadoop installation

2011-01-25 Thread real great..
Hi,
@sourabh: its a simple cluster in lab. So nothing like server and
administrator..
We have always passwords with us only.

On Tue, Jan 25, 2011 at 11:49 AM, Saurabh Dutta  wrote:

> Hi,
>
> Do you've access the to the server. If you don't you'll have to ask the
> administrator to check if the client's IP address is present in the server's
> /etc/hosts.allow and make sure it is not present in the /etc/hosts.deny file
>
> There could be other reasons too like the Fingerprint Keys getting
> corrupted but this should be the first step and only then you should look
> for other options.
>
> Thanks,
> Saurabh Dutta
>
> -Original Message-
> From: real great.. [mailto:greatness.hardn...@gmail.com]
> Sent: Tuesday, January 25, 2011 11:38 AM
> To: common-user
> Subject: SSH problem in hadoop installation
>
> Hi,
> Am trying to install Hadoop on a linux cluster(Fedora 12).
> However, am not able to SSH to localhost and gives the following error.
>
> *ssh_exchange_identification: Connection closed by remote host*
>
> I know this is not the correct forum for asking this question. Yet it could
> solve a lot of my time if any of you could help me.
> Thanks,
>
>
>
> --
> Regards,
> R.V.
>
> 
>
> For updates on our latest innovations in supporting great software products
> in Mobile Computing, Large Data, Cloud Computing, SaaS and Social Media &
> Analytics, follow us on www.twitter.com/impetuscalling.
> OR
> Click http://www.impetus.com to know more.
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>



-- 
Regards,
R.V.


Re: SSH problem in hadoop installation

2011-01-25 Thread rahul patodi
Hi,
Have you installed ssh on all the nodes?
if yes configure it. You can refer
http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-distributed-mode.html

-- 
*Regards*,
Rahul Patodi
Software Engineer,
Impetus Infotech (India) Pvt Ltd,
www.impetus.com
Mob:09907074413

On Tue, Jan 25, 2011 at 5:20 PM, real great..
wrote:

> Hi,
> @sourabh: its a simple cluster in lab. So nothing like server and
> administrator..
> We have always passwords with us only.
>
> On Tue, Jan 25, 2011 at 11:49 AM, Saurabh Dutta <
> saurabh.du...@impetus.co.in
> > wrote:
>
> > Hi,
> >
> > Do you've access the to the server. If you don't you'll have to ask the
> > administrator to check if the client's IP address is present in the
> server's
> > /etc/hosts.allow and make sure it is not present in the /etc/hosts.deny
> file
> >
> > There could be other reasons too like the Fingerprint Keys getting
> > corrupted but this should be the first step and only then you should look
> > for other options.
> >
> > Thanks,
> > Saurabh Dutta
> >
> > -Original Message-
> > From: real great.. [mailto:greatness.hardn...@gmail.com]
> > Sent: Tuesday, January 25, 2011 11:38 AM
> > To: common-user
> > Subject: SSH problem in hadoop installation
> >
> > Hi,
> > Am trying to install Hadoop on a linux cluster(Fedora 12).
> > However, am not able to SSH to localhost and gives the following error.
> >
> > *ssh_exchange_identification: Connection closed by remote host*
> >
> > I know this is not the correct forum for asking this question. Yet it
> could
> > solve a lot of my time if any of you could help me.
> > Thanks,
> >
> >
> >
> > --
> > Regards,
> > R.V.
> >
> > 
> >
> > For updates on our latest innovations in supporting great software
> products
> > in Mobile Computing, Large Data, Cloud Computing, SaaS and Social Media &
> > Analytics, follow us on www.twitter.com/impetuscalling.
> > OR
> > Click http://www.impetus.com to know more.
> >
> >
> > NOTE: This message may contain information that is confidential,
> > proprietary, privileged or otherwise protected by law. The message is
> > intended solely for the named addressee. If received in error, please
> > destroy and notify the sender. Any use of this email is prohibited when
> > received in error. Impetus does not represent, warrant and/or guarantee,
> > that the integrity of this communication has been maintained nor that the
> > communication is free of errors, virus, interception or interference.
> >
>
>
>
> --
> Regards,
> R.V.
>


Re: installation of Hadoop 0.21

2011-01-25 Thread Jim X
Thanks for your information. I was misled by the tutorial at
http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/. I can
access the urls from
 NameNode - http://localhost:50070/
 JobTracker - http://localhost:50030/

instead of from http://localhost:9100 and http://localhost:9101
mentioned in the tutorial.


Jim

On Tue, Jan 25, 2011 at 12:04 AM, li ping  wrote:
> The exception "java.io.IOException: NameNode is not formatted." indicated you
> should format the NameNode first.
> hadoop -fs namenode -format
>
> On Tue, Jan 25, 2011 at 12:47 PM, Jim X  wrote:
>
>> I am trying to install Hadoop by following the instruction from
>> http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/.
>>
>> 1. I can not open http://localhost:9100 or http://localhost:9101 after
>> I run "bin/start-dfs.sh" and "bin/start-mapred.sh" without any error
>> message being printed.
>>
>> 2. I shutdown cygwin shell.
>>
>> 3. I start another cygwin shell, run "bin/start-dfs.sh" and get the
>> following message from the shell.
>>       $ bin/start-dfs.sh
>>       starting namenode, logging to
>> C:\cygwin\hadoop\0.21.0\logs/hadoop-Jim-namenode-Jim-PC.out
>>       localhost: datanode running as process 6908. Stop it first.
>>       localhost: secondarynamenode running as process 6156. Stop it first.
>>
>>
>>
>>       Log message in
>> C:\cygwin\hadoop\0.21.0\logs/hadoop-Jim-namenode-Jim-PC.out are listed
>> as below:
>>
>> 2011-01-24 23:10:11,202 INFO
>> org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode up at:
>> 127.0.0.1/127.0.0.1:9100
>> 2011-01-24 23:10:36,187 INFO org.apache.hadoop.ipc.Server: IPC Server
>> listener on 9100: readAndProcess threw exception java.io.IOException:
>> Unable to read authentication method. Count of bytes read: 0
>> java.io.IOException: Unable to read authentication method
>>        at
>> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1079)
>>        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:525)
>>        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:332)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:619)
>>
>> 
>>
>>
>> 2011-01-24 23:41:47,815 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
>> FSNamesystemStatusMBean
>> 2011-01-24 23:41:47,915 ERROR
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
>> initialization failed.
>> java.io.IOException: NameNode is not formatted.
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:270)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:433)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:421)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)
>>
>>
>> I am running Window Vista, JDK 1.6. I appreciate your help.
>>
>>
>> Jim
>>
>
>
>
> --
> -李平
>


RE: SSH problem in hadoop installation

2011-01-25 Thread Saurabh Dutta
By client I mean the machine from which you are trying to ssh. By server I mean 
the machine which you want to login. Can you paste the contents of your 
/etc/hosts, /etc/hosts.allow, /etc/hosts.deny files.

-Original Message-
From: rahul patodi [mailto:patodira...@gmail.com]
Sent: Tuesday, January 25, 2011 5:31 PM
To: common-user@hadoop.apache.org
Subject: Re: SSH problem in hadoop installation

Hi,
Have you installed ssh on all the nodes?
if yes configure it. You can refer
http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-distributed-mode.html

--
*Regards*,
Rahul Patodi
Software Engineer,
Impetus Infotech (India) Pvt Ltd,
www.impetus.com
Mob:09907074413

On Tue, Jan 25, 2011 at 5:20 PM, real great..
wrote:

> Hi,
> @sourabh: its a simple cluster in lab. So nothing like server and
> administrator..
> We have always passwords with us only.
>
> On Tue, Jan 25, 2011 at 11:49 AM, Saurabh Dutta <
> saurabh.du...@impetus.co.in
> > wrote:
>
> > Hi,
> >
> > Do you've access the to the server. If you don't you'll have to ask the
> > administrator to check if the client's IP address is present in the
> server's
> > /etc/hosts.allow and make sure it is not present in the /etc/hosts.deny
> file
> >
> > There could be other reasons too like the Fingerprint Keys getting
> > corrupted but this should be the first step and only then you should look
> > for other options.
> >
> > Thanks,
> > Saurabh Dutta
> >
> > -Original Message-
> > From: real great.. [mailto:greatness.hardn...@gmail.com]
> > Sent: Tuesday, January 25, 2011 11:38 AM
> > To: common-user
> > Subject: SSH problem in hadoop installation
> >
> > Hi,
> > Am trying to install Hadoop on a linux cluster(Fedora 12).
> > However, am not able to SSH to localhost and gives the following error.
> >
> > *ssh_exchange_identification: Connection closed by remote host*
> >
> > I know this is not the correct forum for asking this question. Yet it
> could
> > solve a lot of my time if any of you could help me.
> > Thanks,
> >
> >
> >
> > --
> > Regards,
> > R.V.
> >
> > 
> >
> > For updates on our latest innovations in supporting great software
> products
> > in Mobile Computing, Large Data, Cloud Computing, SaaS and Social Media &
> > Analytics, follow us on www.twitter.com/impetuscalling.
> > OR
> > Click http://www.impetus.com to know more.
> >
> >
> > NOTE: This message may contain information that is confidential,
> > proprietary, privileged or otherwise protected by law. The message is
> > intended solely for the named addressee. If received in error, please
> > destroy and notify the sender. Any use of this email is prohibited when
> > received in error. Impetus does not represent, warrant and/or guarantee,
> > that the integrity of this communication has been maintained nor that the
> > communication is free of errors, virus, interception or interference.
> >
>
>
>
> --
> Regards,
> R.V.
>



For updates on our latest innovations in supporting great software products in 
Mobile Computing, Large Data, Cloud Computing, SaaS and Social Media & 
Analytics, follow us on www.twitter.com/impetuscalling.
OR
Click http://www.impetus.com to know more.


NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Map->Reduce->Reduce

2011-01-25 Thread Harsh J
Vanilla Hadoop does not support this without the intermediate I/O
cost. You can checkout the Hadoop Online Project at
http://code.google.com/p/hop, as that does support letting a Reducer's
output go directly to the next job's mapper (as in, a pipeline).

In this topic of pipelining, also checkout what's being done in Plume
(Based on Google's FlumeJava): http://github.com/tdunning/Plume

On Tue, Jan 25, 2011 at 5:16 PM, Matthew John
 wrote:
> Hi all,
>
>
> I was working on a MapReduce program which does BytesWritable
> dataprocessing. But currently I am basically running two MapReduces
> consecutively to get the final output :
>
> Input  (MapReduce1)---> Intermediate (MapReduce2)---> Output
>
> Here I am running MapReduce2 only to sort the intermediate data on the basis
> of a Key comparator logic.
>
> I wanted to cut short the number of MapReduces to just one. I have figured
> out a logic to do the same. But the only problem is that in my  logic I need
> to run a sort on the Reduce output to get the  final output. the flow looks
> like this :
>
> Input (MapReduce1)> Output (not sorted)
>
> I want to know if its possible to attach one more Reduce module to the
> dataflow so that it can perform the inherent sort before the 2nd reduce
> call. It would look like :
>
> Input --(Map)---> MapOutput ---(Reduce1)-->Output (not sorted) ---(Reduce2 -
> for which Reduce 1 acts as a Mapper)---> Output
>
> Please let me know  if  there can be some means of sorting the output
> without invoking a separate MapReduce just for the sake of sorting it .
>
> Thanks ,
> Matthew
>



-- 
Harsh J
www.harshj.com


Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-25 Thread Harsh J
Capacity Scheduler (or a version of it) does ship with the 0.20
release of Hadoop and is usable. It can be used to assign queues with
a limited capacity for each, which your jobs must appropriately submit
to if you want them to utilize only the assigned fraction of your
cluster for its processing.

On Tue, Jan 25, 2011 at 5:19 PM, Renaud Delbru  wrote:
> Hi,
>
> we would like to limit the number of maximum tasks per job on our hadoop
> 0.20.2 cluster.
> Is the Capacity Scheduler [1] will allow to do this ? Is it correctly
> working on hadoop 0.20.2 (I remember a  few months ago, we were looking at
> it, but it seemed incompatible with hadoop 0.20.2).
>
> [1] http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.html
>
> Regards,
> --
> Renaud Delbru
>



-- 
Harsh J
www.harshj.com


Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-25 Thread Renaud Delbru
Our experience with the Capacity Scheduler was not what we expected and 
what you describe. But, it might be due to a wrong comprehension of the 
configuration parameters.

The problem is the following:
mapred.capacity-scheduler.queue..capacity: Percentage of the 
number of slots in the cluster that are *guaranteed* to be available for 
jobs in this queue.
mapred.capacity-scheduler.queue..minimum-user-limit-percent: 
Each queue enforces a limit on the percentage of resources allocated to 
a user at any given time, if *there is competition for them*.


So, in fact, it seems that if there is no competition, and that the 
cluster is fully available, the scheduler will assign the full cluster 
to the job, and will not limit the number of concurrent tasks. It seemed 
to us that the only way to enforce a strong limit was to use the Fair 
Scheduler of hadoop 0.21.0 which includes a new configuration parameters 
'maxMaps'.


Am I right, or did we miss something ?

cheers
--
Renaud Delbru

On 25/01/11 15:20, Harsh J wrote:

Capacity Scheduler (or a version of it) does ship with the 0.20
release of Hadoop and is usable. It can be used to assign queues with
a limited capacity for each, which your jobs must appropriately submit
to if you want them to utilize only the assigned fraction of your
cluster for its processing.

On Tue, Jan 25, 2011 at 5:19 PM, Renaud Delbru  wrote:

Hi,

we would like to limit the number of maximum tasks per job on our hadoop
0.20.2 cluster.
Is the Capacity Scheduler [1] will allow to do this ? Is it correctly
working on hadoop 0.20.2 (I remember a  few months ago, we were looking at
it, but it seemed incompatible with hadoop 0.20.2).

[1] http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.html

Regards,
--
Renaud Delbru








Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-25 Thread Harsh J
No, that is right. I did not assume that it was a very strict slot
limit you were looking to impose for your jobs.

On Tue, Jan 25, 2011 at 9:27 PM, Renaud Delbru  wrote:
> Our experience with the Capacity Scheduler was not what we expected and what
> you describe. But, it might be due to a wrong comprehension of the
> configuration parameters.
> The problem is the following:
> mapred.capacity-scheduler.queue..capacity: Percentage of the
> number of slots in the cluster that are *guaranteed* to be available for
> jobs in this queue.
> mapred.capacity-scheduler.queue..minimum-user-limit-percent:
> Each queue enforces a limit on the percentage of resources allocated to a
> user at any given time, if *there is competition for them*.
>
> So, in fact, it seems that if there is no competition, and that the cluster
> is fully available, the scheduler will assign the full cluster to the job,
> and will not limit the number of concurrent tasks. It seemed to us that the
> only way to enforce a strong limit was to use the Fair Scheduler of hadoop
> 0.21.0 which includes a new configuration parameters 'maxMaps'.
>
> Am I right, or did we miss something ?
>
> cheers
> --
> Renaud Delbru
>
> On 25/01/11 15:20, Harsh J wrote:
>>
>> Capacity Scheduler (or a version of it) does ship with the 0.20
>> release of Hadoop and is usable. It can be used to assign queues with
>> a limited capacity for each, which your jobs must appropriately submit
>> to if you want them to utilize only the assigned fraction of your
>> cluster for its processing.
>>
>> On Tue, Jan 25, 2011 at 5:19 PM, Renaud Delbru
>>  wrote:
>>>
>>> Hi,
>>>
>>> we would like to limit the number of maximum tasks per job on our hadoop
>>> 0.20.2 cluster.
>>> Is the Capacity Scheduler [1] will allow to do this ? Is it correctly
>>> working on hadoop 0.20.2 (I remember a  few months ago, we were looking
>>> at
>>> it, but it seemed incompatible with hadoop 0.20.2).
>>>
>>> [1] http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.html
>>>
>>> Regards,
>>> --
>>> Renaud Delbru
>>>
>>
>>
>
>



-- 
Harsh J
www.harshj.com


Hadoop Binary File

2011-01-25 Thread F.Ozgur Catak
Hi,

I'm trying to develop an image processing application with hadoop. All image
files are in HDFS.  But I don't know how to read this files with binary/byte
stream. What is correct decleration of Mapper and Reducer
Class.

Thanks

Ozgur CATAK


Re: Hadoop Binary File

2011-01-25 Thread Keith Wiley
I'm also doing binary image processing on Hadoop.  Where relevant, my Key and 
Value types are a WritableComparable class of my own creation which contains as 
members a BytesWritable object, obviously read from the file itself directly 
into memory.  I also keep the path in my class so I know where the file came 
from later.

On Jan 25, 2011, at 11:46 , F.Ozgur Catak wrote:

> Hi,
> 
> I'm trying to develop an image processing application with hadoop. All image
> files are in HDFS.  But I don't know how to read this files with binary/byte
> stream. What is correct decleration of Mapper and Reducer
> Class.
> 
> Thanks
> 
> Ozgur CATAK



Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
  -- Mark Twain






the performance of HDFS

2011-01-25 Thread Da Zheng

Hello,

I try to measure the performance of HDFS, but the writing rate is quite 
low. When the replication factor is 1, the rate of writing to HDFS is 
about 60MB/s. When the replication factor is 3, the rate drops 
significantly to about 15MB/s. Even though the actual rate of writing 
data to the disk is about 45MB/s, it's still much lower than when 
replication factor is 1. The link between two nodes in the cluster is 
1Gbps. CPU is Dual-Core AMD Opteron(tm) Processor 2212, so CPU isn't 
bottleneck either. I thought I should be able to saturate the disk very 
easily. I wonder where the bottleneck is. What is the throughput for 
writing on a Hadoop cluster when the replication factor is 3?


Thanks,
Da


Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-25 Thread Renaud Delbru
As it seems that the capacity and fair schedulers in hadoop 0.20.2 do 
not allow a hard upper limit in number of concurrent tasks, do anybody 
know any other solutions to achieve this ?

--
Renaud Delbru

On 25/01/11 11:49, Renaud Delbru wrote:

Hi,

we would like to limit the number of maximum tasks per job on our 
hadoop 0.20.2 cluster.
Is the Capacity Scheduler [1] will allow to do this ? Is it correctly 
working on hadoop 0.20.2 (I remember a  few months ago, we were 
looking at it, but it seemed incompatible with hadoop 0.20.2).


[1] http://hadoop.apache.org/common/docs/r0.20.2/capacity_scheduler.html

Regards,




Re: the performance of HDFS

2011-01-25 Thread Ted Dunning
This is a bit lower than it should be, but it is not so far out of line with
what is reasonable.

Did you make sure that have multiple separate disks for HDFS to use?  With
many disks, you should be able to get local disk write speeds up to a few
hundred MB/s.

Once you involve replication then you need to have data go out the network
interface, back in to another machine, back out and back in to a third
machine.  There are lots of copies going on and if you are writing lots of
files, you will typically be limited to 1/2 of your network bandwidth at
most, but doing a bit less than that is to be expected.  What you are seeing
is lower than it should be but a moderate factor.

On Tue, Jan 25, 2011 at 12:33 PM, Da Zheng  wrote:

> Hello,
>
> I try to measure the performance of HDFS, but the writing rate is quite
> low. When the replication factor is 1, the rate of writing to HDFS is about
> 60MB/s. When the replication factor is 3, the rate drops significantly to
> about 15MB/s. Even though the actual rate of writing data to the disk is
> about 45MB/s, it's still much lower than when replication factor is 1. The
> link between two nodes in the cluster is 1Gbps. CPU is Dual-Core AMD
> Opteron(tm) Processor 2212, so CPU isn't bottleneck either. I thought I
> should be able to saturate the disk very easily. I wonder where the
> bottleneck is. What is the throughput for writing on a Hadoop cluster when
> the replication factor is 3?
>
> Thanks,
> Da
>


Re: the performance of HDFS

2011-01-25 Thread M. C. Srivas
On Tue, Jan 25, 2011 at 12:33 PM, Da Zheng  wrote:

> Hello,
>
> I try to measure the performance of HDFS, but the writing rate is quite
> low. When the replication factor is 1, the rate of writing to HDFS is about
> 60MB/s. When the replication factor is 3, the rate drops significantly to
> about 15MB/s. Even though the actual rate of writing data to the disk is
> about 45MB/s, it's still much lower than when replication factor is 1. The
> link between two nodes in the cluster is 1Gbps. CPU is Dual-Core AMD
> Opteron(tm) Processor 2212, so CPU isn't bottleneck either. I thought I
> should be able to saturate the disk very easily. I wonder where the
> bottleneck is. What is the throughput for writing on a Hadoop cluster when
> the replication factor is 3?
>

The numbers above seem correct as per my observations.  If your data is
3-way replicated, the data-node writes about 3x the actual data written.
Conversely, your write-rate will be limited to 1/3 of  how fast the disk can
write, minus some overhead for replication.

The aggregate write-rate can get much higher if you use more drives, but a
single stream throughput is limited to the speed of one disk spindle.






> Thanks,
> Da
>


Re: How to manage large record in MapReduce

2011-01-25 Thread lei
Hi Jerome,

I have a similar problem as yours. Would you please share more details about
your solution?

Thanks,
Lei




Distcp starting only 1 mapper at a time.

2011-01-25 Thread Ravi Phulari
I am trying to to distcp lots of small files from on HDFS to another. But
even after specifying number of mappers = 200 by option -m 200 distcp is
starting only 1 map job.
Average file size is less than 100mb.

Is there any way to start more mappers.
I have tried -m option.

Thanks,
-
Ravi


Re: Building hadoop 0.21.0 from the source

2011-01-25 Thread crumby99

Did you ever figure this out?  I'm having the same exact issue.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Building-hadoop-0-21-0-from-the-source-tp1647589p2335183.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: the performance of HDFS

2011-01-25 Thread Da Zheng

On 01/25/2011 05:49 PM, M. C. Srivas wrote:

On Tue, Jan 25, 2011 at 12:33 PM, Da Zheng  wrote:


Hello,

I try to measure the performance of HDFS, but the writing rate is quite
low. When the replication factor is 1, the rate of writing to HDFS is about
60MB/s. When the replication factor is 3, the rate drops significantly to
about 15MB/s. Even though the actual rate of writing data to the disk is
about 45MB/s, it's still much lower than when replication factor is 1. The
link between two nodes in the cluster is 1Gbps. CPU is Dual-Core AMD
Opteron(tm) Processor 2212, so CPU isn't bottleneck either. I thought I
should be able to saturate the disk very easily. I wonder where the
bottleneck is. What is the throughput for writing on a Hadoop cluster when
the replication factor is 3?


The numbers above seem correct as per my observations.  If your data is
3-way replicated, the data-node writes about 3x the actual data written.
Conversely, your write-rate will be limited to 1/3 of  how fast the disk can
write, minus some overhead for replication.

The aggregate write-rate can get much higher if you use more drives, but a
single stream throughput is limited to the speed of one disk spindle.

You are right. I measure the performance of the hard drive. It seems the 
bottleneck is the hard drive, but the hard drive is a little too slow. 
The average writing rate is 50MB/s.


Thanks,
Da


Re: the performance of HDFS

2011-01-25 Thread Ted Dunning
This is a really slow drive or controller.

Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s.  I would
suspect in the absence of real information that your controller is more
likely to be deficient than your drive.  If this is on a laptop or
something, then I withdraw my thought.

On Tue, Jan 25, 2011 at 4:50 PM, Da Zheng  wrote:

> The aggregate write-rate can get much higher if you use more drives, but a
>> single stream throughput is limited to the speed of one disk spindle.
>>
>>  You are right. I measure the performance of the hard drive. It seems the
> bottleneck is the hard drive, but the hard drive is a little too slow. The
> average writing rate is 50MB/s.


Re: the performance of HDFS

2011-01-25 Thread Da Zheng
No, each node in the cluster is powerful server. I was told the nodes are Dell
Poweredge SC1435, but I cannot figure out the configuration of hard drives. Dell
provides several possible hard drives for this model.

On 1/25/11 7:59 PM, Ted Dunning wrote:
> This is a really slow drive or controller.
> 
> Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s.  I would
> suspect in the absence of real information that your controller is more
> likely to be deficient than your drive.  If this is on a laptop or
> something, then I withdraw my thought.
> 
> On Tue, Jan 25, 2011 at 4:50 PM, Da Zheng  wrote:
> 
>> The aggregate write-rate can get much higher if you use more drives, but a
>>> single stream throughput is limited to the speed of one disk spindle.
>>>
>>>  You are right. I measure the performance of the hard drive. It seems the
>> bottleneck is the hard drive, but the hard drive is a little too slow. The
>> average writing rate is 50MB/s.
> 



Re: the performance of HDFS

2011-01-25 Thread Ted Dunning
Perhaps lshw would help you.

ubuntu:~$ sudo lshw
   ...
*-storage
 description: RAID bus controller
 product: SB700/SB800 SATA Controller [Non-RAID5 mode]
 vendor: ATI Technologies Inc
 physical id: 11
 bus info: pci@:00:11.0
 logical name: scsi0
 logical name: scsi1
 version: 00
 width: 32 bits
 clock: 66MHz
 capabilities: storage pm bus_master cap_list emulated
 configuration: driver=ahci latency=64
 resources: irq:22 ioport:b000(size=8) ioport:a000(size=4)
ioport:9000(size=8) ioport:8000(size=4) ioport:7000(size=16)
memory:fe7ffc00-fe7f
   *-disk
description: ATA Disk
product: ST3750528AS
vendor: Seagate
physical id: 0
bus info: scsi@0:0.0.0
   ...

On Tue, Jan 25, 2011 at 6:29 PM, Da Zheng  wrote:

> No, each node in the cluster is powerful server. I was told the nodes are
> Dell
> Poweredge SC1435, but I cannot figure out the configuration of hard drives.
> Dell
> provides several possible hard drives for this model.
>
> On 1/25/11 7:59 PM, Ted Dunning wrote:
> > This is a really slow drive or controller.
> >
> > Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s.  I would
> > suspect in the absence of real information that your controller is more
> > likely to be deficient than your drive.  If this is on a laptop or
> > something, then I withdraw my thought.
> >
> > On Tue, Jan 25, 2011 at 4:50 PM, Da Zheng  wrote:
> >
> >> The aggregate write-rate can get much higher if you use more drives, but
> a
> >>> single stream throughput is limited to the speed of one disk spindle.
> >>>
> >>>  You are right. I measure the performance of the hard drive. It seems
> the
> >> bottleneck is the hard drive, but the hard drive is a little too slow.
> The
> >> average writing rate is 50MB/s.
> >
>
>


Re: the performance of HDFS

2011-01-25 Thread Da Zheng
unfortunately, the command isn't available in the system and I don't have
privilege to install software:-(


On 1/25/11 9:37 PM, Ted Dunning wrote:
> Perhaps lshw would help you.
> 
> ubuntu:~$ sudo lshw
>...
> *-storage
>  description: RAID bus controller
>  product: SB700/SB800 SATA Controller [Non-RAID5 mode]
>  vendor: ATI Technologies Inc
>  physical id: 11
>  bus info: pci@:00:11.0
>  logical name: scsi0
>  logical name: scsi1
>  version: 00
>  width: 32 bits
>  clock: 66MHz
>  capabilities: storage pm bus_master cap_list emulated
>  configuration: driver=ahci latency=64
>  resources: irq:22 ioport:b000(size=8) ioport:a000(size=4)
> ioport:9000(size=8) ioport:8000(size=4) ioport:7000(size=16)
> memory:fe7ffc00-fe7f
>*-disk
> description: ATA Disk
> product: ST3750528AS
> vendor: Seagate
> physical id: 0
> bus info: scsi@0:0.0.0
>...
> 
> On Tue, Jan 25, 2011 at 6:29 PM, Da Zheng  wrote:
> 
>> No, each node in the cluster is powerful server. I was told the nodes are
>> Dell
>> Poweredge SC1435, but I cannot figure out the configuration of hard drives.
>> Dell
>> provides several possible hard drives for this model.
>>
>> On 1/25/11 7:59 PM, Ted Dunning wrote:
>>> This is a really slow drive or controller.
>>>
>>> Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s.  I would
>>> suspect in the absence of real information that your controller is more
>>> likely to be deficient than your drive.  If this is on a laptop or
>>> something, then I withdraw my thought.
>>>
>>> On Tue, Jan 25, 2011 at 4:50 PM, Da Zheng  wrote:
>>>
 The aggregate write-rate can get much higher if you use more drives, but
>> a
> single stream throughput is limited to the speed of one disk spindle.
>
>  You are right. I measure the performance of the hard drive. It seems
>> the
 bottleneck is the hard drive, but the hard drive is a little too slow.
>> The
 average writing rate is 50MB/s.
>>>
>>
>>
> 



Re: Hadoop Binary File

2011-01-25 Thread F.Ozgur Catak
Can you give me a simple example/source code for this project.

On Tue, Jan 25, 2011 at 10:13 PM, Keith Wiley  wrote:

> I'm also doing binary image processing on Hadoop.  Where relevant, my Key
> and Value types are a WritableComparable class of my own creation which
> contains as members a BytesWritable object, obviously read from the file
> itself directly into memory.  I also keep the path in my class so I know
> where the file came from later.
>
> On Jan 25, 2011, at 11:46 , F.Ozgur Catak wrote:
>
> > Hi,
> >
> > I'm trying to develop an image processing application with hadoop. All
> image
> > files are in HDFS.  But I don't know how to read this files with
> binary/byte
> > stream. What is correct decleration of Mapper and
> Reducer
> > Class.
> >
> > Thanks
> >
> > Ozgur CATAK
>
>
>
> 
> Keith Wiley   kwi...@keithwiley.com
> www.keithwiley.com
>
> "The easy confidence with which I know another man's religion is folly
> teaches
> me to suspect that my own is also."
>  -- Mark Twain
>
> 
>
>
>
>