Re: Fastest way to transfer files

2012-12-28 Thread Joep Rottinghuis
Not sure why you are implying a contradiction when you say: "... distcp is 
useful _but_ you want to do 'it' in java..."

First of all distcp _is_ written in Java.
You can call distcp or any other MR job from Java just fine.

Cheers,

Joep

Sent from my iPhone

On Dec 28, 2012, at 12:01 PM, burakkk  wrote:

> Hi,
> I have two different hdfs cluster. I need to transfer files between these 
> environments. What's the fastest way to transfer files for that situation? 
> 
> I've researched about it. I found distcp command. It's useful but I want to 
> do in java so is there any way to do this?
> 
> Is there any way to transfer files chunk by chunk from one hdfs cluster to 
> another one or is there any way to implement a process using chunks without 
> whole file?
> 
> Thanks
> Best Regards...
> 
> -- 
> BURAK ISIKLI | http://burakisikli.wordpress.com
> 


Re: how to start hadoop 1.0.4 backup node?

2012-12-28 Thread 周梦想
yes, 1.1.x not appear, but document of  1.0.4 is still there.

2012/12/29 Harsh J 

> Hi,
>
> I'd already addressed this via
> https://issues.apache.org/jira/browse/HADOOP-7297 and it isn't present
> anymore in 1.1.x+ docs.
>
> On Sat, Dec 29, 2012 at 11:42 AM, 周梦想  wrote:
> >
> > ok, retported bug as HDFS-4348.
> >
> > thanks.
> > Andy
> >
> > 2012/12/29 Suresh Srinivas 
> >>
> >> This is a documentation bug. Backup node is not available in 1.x
> release.
> >> It is available in 0.23 and 2.x releases. Please create a bug to point
> 1.x
> >> documents to the right set of docs.
> >>
> >> Sent from a mobile device
> >>
> >> On Dec 28, 2012, at 7:13 PM, 周梦想  wrote:
> >>
> >> http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html#Backup+Node
> >>
> >> the document write:
> >> The Backup node is configured in the same manner as the Checkpoint node.
> >> It is started with bin/hdfs namenode -checkpoint
> >>
> >> but hadoop 1.0.4 there is no hdfs file:
> >> [zhouhh@Hadoop48 hadoop-1.0.4]$ ls bin
> >> hadoophadoop-daemons.sh  start-all.sh
> >> start-jobhistoryserver.sh  stop-balancer.sh  stop-mapred.sh
> >> hadoop-config.sh  rccstart-balancer.sh  start-mapred.sh
> >> stop-dfs.sh   task-controller
> >> hadoop-daemon.sh  slaves.sh  start-dfs.sh   stop-all.sh
> >> stop-jobhistoryserver.sh
> >>
> >>
> >> [zhouhh@Hadoop48 hadoop-1.0.4]$ find . -name hdfs
> >> ./webapps/hdfs
> >> ./src/webapps/hdfs
> >> ./src/test/org/apache/hadoop/hdfs
> >> ./src/test/system/aop/org/apache/hadoop/hdfs
> >> ./src/test/system/java/org/apache/hadoop/hdfs
> >> ./src/hdfs
> >> ./src/hdfs/org/apache/hadoop/hdfs
> >>
> >>
> >> thanks!
> >> Andy
> >
> >
>
>
>
> --
> Harsh J
>


Re: how to start hadoop 1.0.4 backup node?

2012-12-28 Thread Harsh J
Hi,

I'd already addressed this via
https://issues.apache.org/jira/browse/HADOOP-7297 and it isn't present
anymore in 1.1.x+ docs.

On Sat, Dec 29, 2012 at 11:42 AM, 周梦想  wrote:
>
> ok, retported bug as HDFS-4348.
>
> thanks.
> Andy
>
> 2012/12/29 Suresh Srinivas 
>>
>> This is a documentation bug. Backup node is not available in 1.x release.
>> It is available in 0.23 and 2.x releases. Please create a bug to point 1.x
>> documents to the right set of docs.
>>
>> Sent from a mobile device
>>
>> On Dec 28, 2012, at 7:13 PM, 周梦想  wrote:
>>
>> http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html#Backup+Node
>>
>> the document write:
>> The Backup node is configured in the same manner as the Checkpoint node.
>> It is started with bin/hdfs namenode -checkpoint
>>
>> but hadoop 1.0.4 there is no hdfs file:
>> [zhouhh@Hadoop48 hadoop-1.0.4]$ ls bin
>> hadoophadoop-daemons.sh  start-all.sh
>> start-jobhistoryserver.sh  stop-balancer.sh  stop-mapred.sh
>> hadoop-config.sh  rccstart-balancer.sh  start-mapred.sh
>> stop-dfs.sh   task-controller
>> hadoop-daemon.sh  slaves.sh  start-dfs.sh   stop-all.sh
>> stop-jobhistoryserver.sh
>>
>>
>> [zhouhh@Hadoop48 hadoop-1.0.4]$ find . -name hdfs
>> ./webapps/hdfs
>> ./src/webapps/hdfs
>> ./src/test/org/apache/hadoop/hdfs
>> ./src/test/system/aop/org/apache/hadoop/hdfs
>> ./src/test/system/java/org/apache/hadoop/hdfs
>> ./src/hdfs
>> ./src/hdfs/org/apache/hadoop/hdfs
>>
>>
>> thanks!
>> Andy
>
>



-- 
Harsh J


Re: how to start hadoop 1.0.4 backup node?

2012-12-28 Thread 周梦想
ok, retported bug as HDFS-4348
.

thanks.
Andy

2012/12/29 Suresh Srinivas 

> This is a documentation bug. Backup node is not available in 1.x release.
> It is available in 0.23 and 2.x releases. Please create a bug to point 1.x
> documents to the right set of docs.
>
> Sent from a mobile device
>
> On Dec 28, 2012, at 7:13 PM, 周梦想  wrote:
>
> http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html#Backup+Node
>
> the document write:
> The Backup node is configured in the same manner as the Checkpoint node.
> It is started with bin/hdfs namenode -checkpoint
>
> but hadoop 1.0.4 there is no hdfs file:
> [zhouhh@Hadoop48 hadoop-1.0.4]$ ls bin
> hadoophadoop-daemons.sh  start-all.sh
> start-jobhistoryserver.sh  stop-balancer.sh  stop-mapred.sh
> hadoop-config.sh  rccstart-balancer.sh  start-mapred.sh
>  stop-dfs.sh   task-controller
> hadoop-daemon.sh  slaves.sh  start-dfs.sh   stop-all.sh
>  stop-jobhistoryserver.sh
>
>
> [zhouhh@Hadoop48 hadoop-1.0.4]$ find . -name hdfs
> ./webapps/hdfs
> ./src/webapps/hdfs
> ./src/test/org/apache/hadoop/hdfs
> ./src/test/system/aop/org/apache/hadoop/hdfs
> ./src/test/system/java/org/apache/hadoop/hdfs
> ./src/hdfs
> ./src/hdfs/org/apache/hadoop/hdfs
>
>
> thanks!
> Andy
>
>


Re: how to start hadoop 1.0.4 backup node?

2012-12-28 Thread Suresh Srinivas
This is a documentation bug. Backup node is not available in 1.x release. It is 
available in 0.23 and 2.x releases. Please create a bug to point 1.x documents 
to the right set of docs. 

Sent from a mobile device

On Dec 28, 2012, at 7:13 PM, 周梦想  wrote:

> http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html#Backup+Node
> 
> the document write:
> The Backup node is configured in the same manner as the Checkpoint node. It 
> is started with bin/hdfs namenode -checkpoint
> 
> but hadoop 1.0.4 there is no hdfs file:
> [zhouhh@Hadoop48 hadoop-1.0.4]$ ls bin
> hadoophadoop-daemons.sh  start-all.sh   
> start-jobhistoryserver.sh  stop-balancer.sh  stop-mapred.sh
> hadoop-config.sh  rccstart-balancer.sh  start-mapred.sh   
>  stop-dfs.sh   task-controller
> hadoop-daemon.sh  slaves.sh  start-dfs.sh   stop-all.sh   
>  stop-jobhistoryserver.sh
> 
> 
> [zhouhh@Hadoop48 hadoop-1.0.4]$ find . -name hdfs
> ./webapps/hdfs
> ./src/webapps/hdfs
> ./src/test/org/apache/hadoop/hdfs
> ./src/test/system/aop/org/apache/hadoop/hdfs
> ./src/test/system/java/org/apache/hadoop/hdfs
> ./src/hdfs
> ./src/hdfs/org/apache/hadoop/hdfs
> 
> 
> thanks!
> Andy


how to start hadoop 1.0.4 backup node?

2012-12-28 Thread 周梦想
http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html#Backup+Node

the document write:
The Backup node is configured in the same manner as the Checkpoint node. It
is started with bin/hdfs namenode -checkpoint

but hadoop 1.0.4 there is no hdfs file:
[zhouhh@Hadoop48 hadoop-1.0.4]$ ls bin
hadoophadoop-daemons.sh  start-all.sh
start-jobhistoryserver.sh  stop-balancer.sh  stop-mapred.sh
hadoop-config.sh  rccstart-balancer.sh  start-mapred.sh
   stop-dfs.sh   task-controller
hadoop-daemon.sh  slaves.sh  start-dfs.sh   stop-all.sh
   stop-jobhistoryserver.sh


[zhouhh@Hadoop48 hadoop-1.0.4]$ find . -name hdfs
./webapps/hdfs
./src/webapps/hdfs
./src/test/org/apache/hadoop/hdfs
./src/test/system/aop/org/apache/hadoop/hdfs
./src/test/system/java/org/apache/hadoop/hdfs
./src/hdfs
./src/hdfs/org/apache/hadoop/hdfs


thanks!
Andy


Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-28 Thread Jay Vyas
the only way to implement B is by doing A (that i know of, at least).
 Also, the word "each" is clearly a dead giveaway that B is the wrong
answer, since it implies special logic for communicating with individual
mappers/reducers.

On Fri, Dec 28, 2012 at 8:20 PM, Edward Capriolo wrote:

> Yes. another big data, data scientist, no ops, devops, cloud computing
> specialist is born. Thank goodness we have multiple choice tests to
> identify the best coders and administrators.
>



-- 
Jay Vyas
http://jayunit100.blogspot.com


Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-28 Thread Edward Capriolo
Yes. another big data, data scientist, no ops, devops, cloud computing
specialist is born. Thank goodness we have multiple choice tests to
identify the best coders and administrators.

On Friday, December 28, 2012, Michel Segel 
wrote:
> Sounds like someone is cheating on a test...
>
> Sent from a remote device. Please excuse any typos...
> Mike Segel
> On Dec 28, 2012, at 3:10 PM, Ted Dunning  wrote:
>
> Answer B sounds pathologically bad to me.
> A or C are the only viable options.
> Neither B nor D work.  B fails because it would be extremely hard to get
the right records to the right components and because it pollutes data
input with configuration data.  D fails because statics don't work in
parallel programs.
>
> On Fri, Dec 28, 2012 at 12:17 AM, Kshiva Kps  wrote:
>
> Which one is current ..
>
> What is the preferred way to pass a small number of configuration
parameters to a mapper or reducer?
>
>
>
>
>
> A.  As key-value pairs in the jobconf object.
>
>
>
> B.  As a custom input key-value pair passed to each mapper or reducer.
>
>


Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-28 Thread Michel Segel
Sounds like someone is cheating on a test...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 28, 2012, at 3:10 PM, Ted Dunning  wrote:

> Answer B sounds pathologically bad to me.
> 
> A or C are the only viable options.
> 
> Neither B nor D work.  B fails because it would be extremely hard to get the 
> right records to the right components and because it pollutes data input with 
> configuration data.  D fails because statics don't work in parallel programs.
> 
> 
> On Fri, Dec 28, 2012 at 12:17 AM, Kshiva Kps  wrote:
>> 
>> Which one is current ..
>> 
>> What is the preferred way to pass a small number of configuration parameters 
>> to a mapper or reducer?
>>  
>>  
>> A.  As key-value pairs in the jobconf object.
>>  
>> B.  As a custom input key-value pair passed to each mapper or reducer.
>>  
>> C.  Using a plain text file via the Distributedcache, which each mapper or 
>> reducer reads.
>>  
>> D.  Through a static variable in the MapReduce driver class (i.e., the class 
>> that submits the MapReduce job).
>>  
>> Answer: B
> 


Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-28 Thread Ted Dunning
Answer B sounds pathologically bad to me.

A or C are the only viable options.

Neither B nor D work.  B fails because it would be extremely hard to get
the right records to the right components and because it pollutes data
input with configuration data.  D fails because statics don't work in
parallel programs.


On Fri, Dec 28, 2012 at 12:17 AM, Kshiva Kps  wrote:

>
> Which one is current ..
>
>
> What is the preferred way to pass a small number of configuration
> parameters to a mapper or reducer?
>
>
>
>
>
> *A.  *As key-value pairs in the jobconf object.
>
> * *
>
> *B.  *As a custom input key-value pair passed to each mapper or reducer.
>
> * *
>
> *C.  *Using a plain text file via the Distributedcache, which each mapper
> or reducer reads.
>
> * *
>
> *D.  *Through a static variable in the MapReduce driver class (i.e., the
> class that submits the MapReduce job).
>
>
>
> *Answer: B*
>
>
>


Re: Hadoop harddrive space usage

2012-12-28 Thread Robert Molina
Hi Jean,
Hadoop will not factor in number of disks or directories, but rather mainly
allocated free space.  Hadoop will do its best to spread the data across
evenly amongst the nodes.  For instance, let's say you had 3 datanodes
(replication factor 1) and all have allocated 10GB each, but one of the
nodes split the 10GB into two directories.  Now if we try to store a file
that takes up 3 blocks, Hadoop will just place 1 block in each node.

Hope that helps.

Regards,
Robert

On Fri, Dec 28, 2012 at 9:12 AM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> Hi,
>
> Quick question regarding hard drive space usage.
>
> Hadoop will distribute the data evenly on the cluster. So all the
> nodes are going to receive almost the same quantity of data to store.
>
> Now, if on one node I have 2 directories configured, is hadoop going
> to assign twice the quantity on this node? Or is each directory going
> to receive half the load?
>
> Thanks,
>
> JM
>


Hadoop harddrive space usage

2012-12-28 Thread Jean-Marc Spaggiari
Hi,

Quick question regarding hard drive space usage.

Hadoop will distribute the data evenly on the cluster. So all the
nodes are going to receive almost the same quantity of data to store.

Now, if on one node I have 2 directories configured, is hadoop going
to assign twice the quantity on this node? Or is each directory going
to receive half the load?

Thanks,

JM


Re: question about ZKFC daemon

2012-12-28 Thread ESGLinux
Thank you for your answer Craig,

I´m planning my cluster and for now I´m not sure how many machines I need;-)

If I have doubt i´ll what clouder say and If have a problem I have where to
ask for explications :-)

ESGLinux



2012/12/28 Craig Munro 

> OK, I have reliable storage on my datanodes so not an issue for me.  If
> that's what Cloudera recommends then I'm sure it's fine.
> On Dec 28, 2012 10:38 AM, "ESGLinux"  wrote:
>
>> Hi Craig,
>>
>> I´m a bit confused, I have read this from cloudera:
>> https://ccp.cloudera.com/display/CDH4DOC/Hardware+Configuration+for+Quorum-based+Storage
>>
>> The JournalNode daemon is relatively lightweight, so these daemons can
>> reasonably be collocated on machines with other Hadoop daemons, for example
>> NameNodes, the JobTracker, or the YARN ResourceManager.
>> Cloudera recommends that you deploy the JournalNode daemons on the
>> "master" host or hosts (NameNode, Standby NameNode, JobTracker, etc.) so
>> the JournalNodes' local directories can use the reliable local storage on
>> those machines.
>> There must be at least three JournalNode daemons, since edit log
>> modifications must be written to a majority of JournalNodes
>> as you can read they recommend to put journalnode daemons with the
>> namenodes, but you say the opposite.??¿?¿??
>>
>>
>> Thanks for your answer,
>>
>> ESGLinux,
>>
>>
>>
>>
>> 2012/12/28 Craig Munro 
>>
>>> You need the following:
>>>
>>> - active namenode + zkfc
>>> - standby namenode + zkfc
>>> - pool of journal nodes (odd number, 3 or more)
>>> - pool of zookeeper nodes (odd number, 3 or more)
>>>
>>> As the journal nodes hold the namesystem transactions they should not be
>>> co-located with the namenodes in case of failure.  I distribute the journal
>>> and zookeeper nodes across the hosts running datanodes or as Harsh says you
>>> could co-locate them on dedicated hosts.
>>>
>>> ZKFC does not monitor the JobTracker.
>>>
>>> Regards,
>>> Craig
>>> On Dec 28, 2012 9:25 AM, "ESGLinux"  wrote:
>>>
 Hi,

 well, If I have understand you I can configure my NN HA cluster this
 way:

 - Active NameNode + 1 ZKFC daemon + Journal Node
 - Standby NameNode + 1 ZKFC daemon + Journal Node
 - JobTracker node + 1 ZKFC daemon + Journal Node,

 Is this right?

 Thanks in advance,

 ESGLinux,

 2012/12/27 Harsh J 

> Hi,
>
> There are two different things here: Automatic Failover and Quorum
> Journal Manager. The former, used via a ZooKeeper Failover Controller,
> is to manage failovers automatically (based on health checks of NNs).
> The latter, used via a set of Journal Nodes, is a medium of shared
> storage for namesystem transactions that helps enable HA.
>
> In a typical deployment, you want 3 or more (odd) JournalNodes for
> reliable HA, preferably on nodes of their own if possible (like you
> would for typical ZooKeepers, and you may co-locate with those as
> well) and one ZKFC for each NameNode (connected to the same ZK
> quorum).
>
> On Thu, Dec 27, 2012 at 5:33 PM, ESGLinux  wrote:
> > Hi all,
> >
> > I have a doubt about how to deploy the Zookeeper in a NN HA  cluster,
> >
> > As far as I know, I need at least three nodes to run three ZooKeeper
> > FailOver Controller (ZKFC). I plan to put these 3 daemons this way:
> >
> > - Active NameNode + 1 ZKFC daemon
> > - Standby NameNode + 1 ZKFC daemon
> > - JobTracker node + 1 ZKFC daemon, (is this right?)
> >
> > so the quorum is formed with these three nodes. The nodes that runs a
> > namenode are right because the ZKFC monitors it, but what does the
> third
> > daemon?
> >
> > as I read from this url:
> >
> https://ccp.cloudera.com/display/CDH4DOC/Software+Configuration+for+Quorum-based+Storage#SoftwareConfigurationforQuorum-basedStorage-AutomaticFailoverConfiguration
> >
> > this daemons are only related with NameNodes, (Health monitoring -
> the ZKFC
> > pings its local NameNode on a periodic basis with a health-check
> command.)
> > so what does the third ZKFC? I used the jobtracker node but I could
> use
> > another node without any daemon on it...
> >
> > Thanks in advance,
> >
> > ESGLInux,
> >
> >
> >
>
>
>
> --
> Harsh J
>


>>


Re: question about ZKFC daemon

2012-12-28 Thread Craig Munro
OK, I have reliable storage on my datanodes so not an issue for me.  If
that's what Cloudera recommends then I'm sure it's fine.
On Dec 28, 2012 10:38 AM, "ESGLinux"  wrote:

> Hi Craig,
>
> I´m a bit confused, I have read this from cloudera:
> https://ccp.cloudera.com/display/CDH4DOC/Hardware+Configuration+for+Quorum-based+Storage
>
> The JournalNode daemon is relatively lightweight, so these daemons can
> reasonably be collocated on machines with other Hadoop daemons, for example
> NameNodes, the JobTracker, or the YARN ResourceManager.
> Cloudera recommends that you deploy the JournalNode daemons on the
> "master" host or hosts (NameNode, Standby NameNode, JobTracker, etc.) so
> the JournalNodes' local directories can use the reliable local storage on
> those machines.
> There must be at least three JournalNode daemons, since edit log
> modifications must be written to a majority of JournalNodes
> as you can read they recommend to put journalnode daemons with the
> namenodes, but you say the opposite.??¿?¿??
>
>
> Thanks for your answer,
>
> ESGLinux,
>
>
>
>
> 2012/12/28 Craig Munro 
>
>> You need the following:
>>
>> - active namenode + zkfc
>> - standby namenode + zkfc
>> - pool of journal nodes (odd number, 3 or more)
>> - pool of zookeeper nodes (odd number, 3 or more)
>>
>> As the journal nodes hold the namesystem transactions they should not be
>> co-located with the namenodes in case of failure.  I distribute the journal
>> and zookeeper nodes across the hosts running datanodes or as Harsh says you
>> could co-locate them on dedicated hosts.
>>
>> ZKFC does not monitor the JobTracker.
>>
>> Regards,
>> Craig
>> On Dec 28, 2012 9:25 AM, "ESGLinux"  wrote:
>>
>>> Hi,
>>>
>>> well, If I have understand you I can configure my NN HA cluster this way:
>>>
>>> - Active NameNode + 1 ZKFC daemon + Journal Node
>>> - Standby NameNode + 1 ZKFC daemon + Journal Node
>>> - JobTracker node + 1 ZKFC daemon + Journal Node,
>>>
>>> Is this right?
>>>
>>> Thanks in advance,
>>>
>>> ESGLinux,
>>>
>>> 2012/12/27 Harsh J 
>>>
 Hi,

 There are two different things here: Automatic Failover and Quorum
 Journal Manager. The former, used via a ZooKeeper Failover Controller,
 is to manage failovers automatically (based on health checks of NNs).
 The latter, used via a set of Journal Nodes, is a medium of shared
 storage for namesystem transactions that helps enable HA.

 In a typical deployment, you want 3 or more (odd) JournalNodes for
 reliable HA, preferably on nodes of their own if possible (like you
 would for typical ZooKeepers, and you may co-locate with those as
 well) and one ZKFC for each NameNode (connected to the same ZK
 quorum).

 On Thu, Dec 27, 2012 at 5:33 PM, ESGLinux  wrote:
 > Hi all,
 >
 > I have a doubt about how to deploy the Zookeeper in a NN HA  cluster,
 >
 > As far as I know, I need at least three nodes to run three ZooKeeper
 > FailOver Controller (ZKFC). I plan to put these 3 daemons this way:
 >
 > - Active NameNode + 1 ZKFC daemon
 > - Standby NameNode + 1 ZKFC daemon
 > - JobTracker node + 1 ZKFC daemon, (is this right?)
 >
 > so the quorum is formed with these three nodes. The nodes that runs a
 > namenode are right because the ZKFC monitors it, but what does the
 third
 > daemon?
 >
 > as I read from this url:
 >
 https://ccp.cloudera.com/display/CDH4DOC/Software+Configuration+for+Quorum-based+Storage#SoftwareConfigurationforQuorum-basedStorage-AutomaticFailoverConfiguration
 >
 > this daemons are only related with NameNodes, (Health monitoring -
 the ZKFC
 > pings its local NameNode on a periodic basis with a health-check
 command.)
 > so what does the third ZKFC? I used the jobtracker node but I could
 use
 > another node without any daemon on it...
 >
 > Thanks in advance,
 >
 > ESGLInux,
 >
 >
 >



 --
 Harsh J

>>>
>>>
>


Re: question about ZKFC daemon

2012-12-28 Thread ESGLinux
Hi Craig,

I´m a bit confused, I have read this from cloudera:
https://ccp.cloudera.com/display/CDH4DOC/Hardware+Configuration+for+Quorum-based+Storage

The JournalNode daemon is relatively lightweight, so these daemons can
reasonably be collocated on machines with other Hadoop daemons, for example
NameNodes, the JobTracker, or the YARN ResourceManager.
Cloudera recommends that you deploy the JournalNode daemons on the "master"
host or hosts (NameNode, Standby NameNode, JobTracker, etc.) so the
JournalNodes' local directories can use the reliable local storage on those
machines.
There must be at least three JournalNode daemons, since edit log
modifications must be written to a majority of JournalNodes
as you can read they recommend to put journalnode daemons with the
namenodes, but you say the opposite.??¿?¿??


Thanks for your answer,

ESGLinux,




2012/12/28 Craig Munro 

> You need the following:
>
> - active namenode + zkfc
> - standby namenode + zkfc
> - pool of journal nodes (odd number, 3 or more)
> - pool of zookeeper nodes (odd number, 3 or more)
>
> As the journal nodes hold the namesystem transactions they should not be
> co-located with the namenodes in case of failure.  I distribute the journal
> and zookeeper nodes across the hosts running datanodes or as Harsh says you
> could co-locate them on dedicated hosts.
>
> ZKFC does not monitor the JobTracker.
>
> Regards,
> Craig
> On Dec 28, 2012 9:25 AM, "ESGLinux"  wrote:
>
>> Hi,
>>
>> well, If I have understand you I can configure my NN HA cluster this way:
>>
>> - Active NameNode + 1 ZKFC daemon + Journal Node
>> - Standby NameNode + 1 ZKFC daemon + Journal Node
>> - JobTracker node + 1 ZKFC daemon + Journal Node,
>>
>> Is this right?
>>
>> Thanks in advance,
>>
>> ESGLinux,
>>
>> 2012/12/27 Harsh J 
>>
>>> Hi,
>>>
>>> There are two different things here: Automatic Failover and Quorum
>>> Journal Manager. The former, used via a ZooKeeper Failover Controller,
>>> is to manage failovers automatically (based on health checks of NNs).
>>> The latter, used via a set of Journal Nodes, is a medium of shared
>>> storage for namesystem transactions that helps enable HA.
>>>
>>> In a typical deployment, you want 3 or more (odd) JournalNodes for
>>> reliable HA, preferably on nodes of their own if possible (like you
>>> would for typical ZooKeepers, and you may co-locate with those as
>>> well) and one ZKFC for each NameNode (connected to the same ZK
>>> quorum).
>>>
>>> On Thu, Dec 27, 2012 at 5:33 PM, ESGLinux  wrote:
>>> > Hi all,
>>> >
>>> > I have a doubt about how to deploy the Zookeeper in a NN HA  cluster,
>>> >
>>> > As far as I know, I need at least three nodes to run three ZooKeeper
>>> > FailOver Controller (ZKFC). I plan to put these 3 daemons this way:
>>> >
>>> > - Active NameNode + 1 ZKFC daemon
>>> > - Standby NameNode + 1 ZKFC daemon
>>> > - JobTracker node + 1 ZKFC daemon, (is this right?)
>>> >
>>> > so the quorum is formed with these three nodes. The nodes that runs a
>>> > namenode are right because the ZKFC monitors it, but what does the
>>> third
>>> > daemon?
>>> >
>>> > as I read from this url:
>>> >
>>> https://ccp.cloudera.com/display/CDH4DOC/Software+Configuration+for+Quorum-based+Storage#SoftwareConfigurationforQuorum-basedStorage-AutomaticFailoverConfiguration
>>> >
>>> > this daemons are only related with NameNodes, (Health monitoring - the
>>> ZKFC
>>> > pings its local NameNode on a periodic basis with a health-check
>>> command.)
>>> > so what does the third ZKFC? I used the jobtracker node but I could use
>>> > another node without any daemon on it...
>>> >
>>> > Thanks in advance,
>>> >
>>> > ESGLInux,
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>


Re: question about ZKFC daemon

2012-12-28 Thread Craig Munro
You need the following:

- active namenode + zkfc
- standby namenode + zkfc
- pool of journal nodes (odd number, 3 or more)
- pool of zookeeper nodes (odd number, 3 or more)

As the journal nodes hold the namesystem transactions they should not be
co-located with the namenodes in case of failure.  I distribute the journal
and zookeeper nodes across the hosts running datanodes or as Harsh says you
could co-locate them on dedicated hosts.

ZKFC does not monitor the JobTracker.

Regards,
Craig
On Dec 28, 2012 9:25 AM, "ESGLinux"  wrote:

> Hi,
>
> well, If I have understand you I can configure my NN HA cluster this way:
>
> - Active NameNode + 1 ZKFC daemon + Journal Node
> - Standby NameNode + 1 ZKFC daemon + Journal Node
> - JobTracker node + 1 ZKFC daemon + Journal Node,
>
> Is this right?
>
> Thanks in advance,
>
> ESGLinux,
>
> 2012/12/27 Harsh J 
>
>> Hi,
>>
>> There are two different things here: Automatic Failover and Quorum
>> Journal Manager. The former, used via a ZooKeeper Failover Controller,
>> is to manage failovers automatically (based on health checks of NNs).
>> The latter, used via a set of Journal Nodes, is a medium of shared
>> storage for namesystem transactions that helps enable HA.
>>
>> In a typical deployment, you want 3 or more (odd) JournalNodes for
>> reliable HA, preferably on nodes of their own if possible (like you
>> would for typical ZooKeepers, and you may co-locate with those as
>> well) and one ZKFC for each NameNode (connected to the same ZK
>> quorum).
>>
>> On Thu, Dec 27, 2012 at 5:33 PM, ESGLinux  wrote:
>> > Hi all,
>> >
>> > I have a doubt about how to deploy the Zookeeper in a NN HA  cluster,
>> >
>> > As far as I know, I need at least three nodes to run three ZooKeeper
>> > FailOver Controller (ZKFC). I plan to put these 3 daemons this way:
>> >
>> > - Active NameNode + 1 ZKFC daemon
>> > - Standby NameNode + 1 ZKFC daemon
>> > - JobTracker node + 1 ZKFC daemon, (is this right?)
>> >
>> > so the quorum is formed with these three nodes. The nodes that runs a
>> > namenode are right because the ZKFC monitors it, but what does the third
>> > daemon?
>> >
>> > as I read from this url:
>> >
>> https://ccp.cloudera.com/display/CDH4DOC/Software+Configuration+for+Quorum-based+Storage#SoftwareConfigurationforQuorum-basedStorage-AutomaticFailoverConfiguration
>> >
>> > this daemons are only related with NameNodes, (Health monitoring - the
>> ZKFC
>> > pings its local NameNode on a periodic basis with a health-check
>> command.)
>> > so what does the third ZKFC? I used the jobtracker node but I could use
>> > another node without any daemon on it...
>> >
>> > Thanks in advance,
>> >
>> > ESGLInux,
>> >
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>


Re: distributed cache

2012-12-28 Thread Lin Ma
Thanks Harsh,

(1) "Thankfully, due to block sizes the latter isn't a problem for large
files on a proper DN, as the blocks are spread over the disks and across
the nodes." -- What do you mean DN?

(2) So, you mean concurrent read for small block will not degrade
performance, but concurrent read for large block will degrade performance
compared to single thread read for large block? Please feel free to correct
me if I am wrong. The results are interesting. Appreciate if you could
elaborate a bit more details why.

regards,
Lin

On Wed, Dec 26, 2012 at 8:19 PM, Harsh J  wrote:

> Hi,
>
> Sorry for having been ambiguous. For (1) I meant a large block (if the
> block size is large). For (2) I meant multiple, concurrent threads.
>
> On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma  wrote:
> > Thanks Harsh,
> >
> > For long read, you mean read a large continuous part of a file, other
> than a
> > small chunk of a file?
> > "gradually decreasing performance for long reads" -- you mean parallel
> > multiple threads long read degrade performance? Or single thread
> exclusive
> > long read degrade performance?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 7:48 PM, Harsh J  wrote:
> >>
> >> Hi Lin,
> >>
> >> It is comparable (and is also logically similar) to reading a file
> >> multiple times in parallel in a local filesystem - not too much of a
> >> performance hit for small reads (by virtue of OS caches, and quick
> >> completion per read, as is usually the case for distributed cache
> >> files), and gradually decreasing performance for long reads (due to
> >> frequent disk physical movement)? Thankfully, due to block sizes the
> >> latter isn't a problem for large files on a proper DN, as the blocks
> >> are spread over the disks and across the nodes.
> >>
> >> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma  wrote:
> >> > Thanks Harsh, multiple concurrent read is generally faster or?
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J  wrote:
> >> >>
> >> >> There is no limitation in HDFS that limits reads of a block to a
> >> >> single client at a time (no reason to do so) - so downloads can be as
> >> >> concurrent as possible.
> >> >>
> >> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma  wrote:
> >> >> > Thanks Harsh,
> >> >> >
> >> >> > Supposing DistributedCache is uploaded by client, for each replica,
> >> >> > in
> >> >> > Hadoop design, it could only serve one download session (download
> >> >> > from a
> >> >> > mapper or a reducer which requires the DistributedCache) at a time
> >> >> > until
> >> >> > DistributedCache file download is completed, or it could serve
> >> >> > multiple
> >> >> > concurrent parallel download session (download from multiple
> mappers
> >> >> > or
> >> >> > reducers which requires the DistributedCache).
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J 
> wrote:
> >> >> >>
> >> >> >> Hi Lin,
> >> >> >>
> >> >> >> DistributedCache files are stored onto the HDFS by the client
> first.
> >> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> >> higher replicas.
> >> >> >>
> >> >> >> The point of having higher replication for these files is also
> tied
> >> >> >> to
> >> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> >> spread across racks such that on task bootup the downloads happen
> >> >> >> with
> >> >> >> rack locality.
> >> >> >>
> >> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma  wrote:
> >> >> >> > Hi Kai,
> >> >> >> >
> >> >> >> > Smart answer! :-)
> >> >> >> >
> >> >> >> > The assumption you have is one distributed cache replica could
> >> >> >> > only
> >> >> >> > serve
> >> >> >> > one download session for tasktracker node (this is why you get
> >> >> >> > concurrency
> >> >> >> > n/r). The question is, why one distributed cache replica cannot
> >> >> >> > serve
> >> >> >> > multiple concurrent download session? For example, supposing a
> >> >> >> > tasktracker
> >> >> >> > use elapsed time t to download a file from a specific
> distributed
> >> >> >> > cache
> >> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> >> > specific
> >> >> >> > distributed cache replica in parallel using elapsed time t as
> >> >> >> > well,
> >> >> >> > or
> >> >> >> > 1.5
> >> >> >> > t, which is faster than sequential download time 2t you
> mentioned
> >> >> >> > before?
> >> >> >> > "In total, r+n/r concurrent operations. If you optimize r
> >> >> >> > depending
> >> >> >> > on
> >> >> >> > n,
> >> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> >> >> >> > SRQT(n)
> >> >> >> > for
> >> >> >> > minimize r+n/r? Appreciate if you could point me to more
> details.
> >> >> >> >
> >> >> >> > regards,
> >> >> >> > Lin
> >> >> >> >
> >> >> >> >
> >> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt  wrote:

Re: question about ZKFC daemon

2012-12-28 Thread ESGLinux
Hi,

well, If I have understand you I can configure my NN HA cluster this way:

- Active NameNode + 1 ZKFC daemon + Journal Node
- Standby NameNode + 1 ZKFC daemon + Journal Node
- JobTracker node + 1 ZKFC daemon + Journal Node,

Is this right?

Thanks in advance,

ESGLinux,

2012/12/27 Harsh J 

> Hi,
>
> There are two different things here: Automatic Failover and Quorum
> Journal Manager. The former, used via a ZooKeeper Failover Controller,
> is to manage failovers automatically (based on health checks of NNs).
> The latter, used via a set of Journal Nodes, is a medium of shared
> storage for namesystem transactions that helps enable HA.
>
> In a typical deployment, you want 3 or more (odd) JournalNodes for
> reliable HA, preferably on nodes of their own if possible (like you
> would for typical ZooKeepers, and you may co-locate with those as
> well) and one ZKFC for each NameNode (connected to the same ZK
> quorum).
>
> On Thu, Dec 27, 2012 at 5:33 PM, ESGLinux  wrote:
> > Hi all,
> >
> > I have a doubt about how to deploy the Zookeeper in a NN HA  cluster,
> >
> > As far as I know, I need at least three nodes to run three ZooKeeper
> > FailOver Controller (ZKFC). I plan to put these 3 daemons this way:
> >
> > - Active NameNode + 1 ZKFC daemon
> > - Standby NameNode + 1 ZKFC daemon
> > - JobTracker node + 1 ZKFC daemon, (is this right?)
> >
> > so the quorum is formed with these three nodes. The nodes that runs a
> > namenode are right because the ZKFC monitors it, but what does the third
> > daemon?
> >
> > as I read from this url:
> >
> https://ccp.cloudera.com/display/CDH4DOC/Software+Configuration+for+Quorum-based+Storage#SoftwareConfigurationforQuorum-basedStorage-AutomaticFailoverConfiguration
> >
> > this daemons are only related with NameNodes, (Health monitoring - the
> ZKFC
> > pings its local NameNode on a periodic basis with a health-check
> command.)
> > so what does the third ZKFC? I used the jobtracker node but I could use
> > another node without any daemon on it...
> >
> > Thanks in advance,
> >
> > ESGLInux,
> >
> >
> >
>
>
>
> --
> Harsh J
>