Re: PNW Hadoop + Apache Cloud Stack Meetup, Wed. May 27th:
On Tue, May 26, 2009 at 6:42 PM, Bradford Stephens wrote: > Greetings, > This is a friendly reminder that the 1st meetup for the PNW Hadoop + Apache > Cloud Stack User Group is THIS WEDNESDAY at 6:45pm. We're very excited to > have everyone attend! > > University of Washington, Allen Center Room 303, at 6:45pm on Wednesday, May > 27, 2009. > I'm going to put together a map, and a wiki so we can collab. > > The Allen Center is located here: > http://www.washington.edu/home/maps/?CSE > > What I'm envisioning is a meetup for about 2 hours: we'll have two in-depth > talks of 15-20 minutes each, and then several "lightning talks" of 5 > minutes. We'll then have discussion and 'social time'. > Let me know if you're interested in speaking or attending. > > I'd like to focus on education, so every presentation *needs* to ask some > questions at the end. We can talk about these after the presentations, and > I'll record what we've learned in a wiki and share that with the rest of > us. > > Looking forward to meeting you all! sadly i'm on the other side of the pond (we did some talking at ApacheConEurope about the Apache Cloud Stack. so, i'm going to jump in here with a few ideas.) 1. after a while, mass cross postings start to become less efficient than a dedicated announcements list. since clouds cuts across project boundaries, this would need to be a cross-project mailing list - announceme...@clouds.apache.org, say. if this would be useful to the community, i'm willing to take a look at steering it through on the apache side. opinions? 2. we've made a start at http://cwiki.apache.org/labs/clouds.html trying to describe the projects, to document how they relate and to create space for discussions of architecture. as with any labs project, karma will be granted to any committer (just ask on the labs list). all very welcome. - robert
Re: MultipleOutputs / Hadoop 0.20.0?
Hi Dan, I do not think we have support for MultipeOutputs with the New API just yet. Jothi On 5/27/09 2:08 AM, "Daniel Young" wrote: > Greetings, list: > > I'm part of a project just getting started using Hadoop-core, and we figured > we may as well start with version 0.20.0 due to the refactoring of the Mapper > and Reducer classes. We'd also like to make use of MultipleOutputs, or > something with a similar functionality. However, it depends upon the > now-deprecated JobConf class, as opposed to using Job as seems preferred by > 0.20.0. > > Does the ability to define and configure multiple outputs exist in a > 0.20.0-friendly form? Perhaps rolled into a OutputFormat somewhere? Using > the deprecated classes isn't too big of a deal; I'm just wondering if this > feature was updated in the 0.20.0 implementation. > > Thanks, > > Dan
Re: Bisection bandwidth needed for a small Hadoop cluster
Stephen, I highly recommend you get switches that have a few 10GigE ports on them for interlinking. So say you have 40 servers/rack, then get top-of-rack switches with forty 1GigE ports and two 10GigE ports, this way you can have half the servers talking to the other rack without bottlenecks. The prices for such switches have been dropping significantly in recent months (you should be able to get them for sub $10K). Take a look at the switches from Arista Networks (www.aristanetworks.com) for competitive pricing. Cheers, -- amr Todd Lipcon wrote: Hi Stephen, The true answer depends on the types of jobs you're running. As a back of the envelope calculation I might figure something like this: 60 nodes total = 30 nodes per rack Each node might process about 100MB/sec of data In the case of a sort job where the intermediate data is the same size as the input data, that means each node needs to shuffle 100MB/sec of data In aggregate, each rack is then producing about 3GB/sec of data However, given even reducer spread across the racks, each rack will need to send 1.5GB/sec to reducers running on the other rack. Since the connection is full duplex, that means you need 1.5GB/sec of bisection bandwidth for this theoretical job. So that's 12Gbps. However, the above calculations are probably somewhat of an upper bound. A large number of jobs have significant data reduction during the map phase, either by some kind of filtering/selection going on in the Mapper itself, or by good usage of Combiners. Additionally, intermediate data compression can cut the intermediate data transfer by a significant factor. Lastly, although your disks can probably provide 100MB sustained throughput, it's rare to see a MR job which can sustain disk speed IO through the entire pipeline. So, I'd say my estimate is at least a factor of 2 too high. So, the simple answer is that 4-6Gbps is most likely just fine for most practical jobs. If you want to be extra safe, many inexpensive switches can operate in a "stacked" configuration where the bandwidth between them is essentially backplane speed. That should scale you to 96 nodes with plenty of headroom. -Todd On Tue, May 26, 2009 at 3:10 AM, stephen mulcahy wrote: Hi, Has anyone here investigated what level of bisection bandwidth is needed for a Hadoop cluster which spans more than one rack? I'm currently sizing and planning a new Hadoop cluster and I'm wondering what the performance implications will be if we end up with a cluster spread across two racks. I'd expect we'll have one 48-port gigabit switch in each 42u rack. If we end up with 60 systems spread across these two switches - how much bandwidth should I have between the racks? I'll have 6 gigabit ports available for links between racks - i.e. up to 6 Gbps. Would this be sufficient bisection bandwidth for Hadoop or should I be considering increased bandwidth between racks (maybe using fibre links between the switches or introducing another switch)? Thanks for any thoughts on this. -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Persistent storage on EC2
I'm using EBS volumes to have a persistent HDFS on EC2. Do I need to keep the master updated on how to map the internal IPs, which change as I understand, to a known set of host names so it knows where the blocks are located each time I bring a cluster up? If so, is keeping a mapping up to date in /etc/hosts sufficient? Thanks
RE: Question regarding dwontime model for DataNodes
Aaron - Thank you! Is the 10 minute interval configurable? Regards, Joseph Hammerman -Original Message- From: Aaron Kimball [mailto:aa...@cloudera.com] Sent: Tuesday, May 26, 2009 4:29 PM To: core-user@hadoop.apache.org Subject: Re: Question regarding dwontime model for DataNodes A DataNode is regarded as "dead" after 10 minutes of inactivity. If a DataNode is down for < 10 minutes, and quickly returns, then nothing happens. After 10 minutes, its blocks are assumed to be lost forever. So the NameNode then begins scheduling those blocks for re-replication from the two surviving copies. There is no real-time bound on how long this process takes. It's a function of your available network bandwidth, server loads, disk speeds, amount of storage used, etc. Blocks which are down to a single surviving replica are prioritized over blocks which have two surviving replicas (in the event of a simultaneous or near-simultaneous double fault), since they are more vulnerable. If a DataNode does reappear, then its re-replication is cancelled, and over-provisioned blocks are scaled back to the target number of replicas. If that machine comes back with all blocks intact, then the node needs no "rebuilding." (In fact, some of the over-provisioned replicas that get removed might be from the original node, if they're available elsewhere too!) Don't forget that machines in Hadoop do not have strong notions of identity. If a particular machine is taken offline and its disks are wiped, the blocks which were there (which also existed in two other places) will be re-replicated elsewhere from the live copies. When that same machine is then brought back online, it has no incentive to "copy back" all the blocks that it used to have, as there will be three replicas elsewhere in the cluster. Blocks are never permanently bound to particular machines. If you add recommissed or new nodes to a cluster, you should run the rebalancing script which will take a random sampling of blocks from heavily-laden nodes and move them onto emptier nodes in an attempt to spread the data as evenly as possible. - Aaron On Tue, May 26, 2009 at 3:08 PM, Joe Hammerman wrote: > Hello Hadoop Users list: > >We are running Hadoop version 0.18.2. My team lead has asked > me to investigate the answer to a particular question regarding Hadoop's > handling of offline DataNodes - specifically, we would like to know how long > a node can be offline before it is totally rebuilt when it has been readded > to the cluster. >From what I've been able to determine from the documentation > it appears to me that the NameNode will simply begin scheduling block > replication on its remaining cluster members. If the offline node comes back > online, and it reports all its blocks as being uncorrupted, then the > NameNode just cleans up the "extra" blocks. >In other words, there is no explicit handling based on the > length of the outage - the behavior of the cluster will depend entirely on > the outage duration. > >Anyone care to shed some light on this? > >Thanks! > Regards, >Joseph Hammerman >
Re: Question regarding dwontime model for DataNodes
A DataNode is regarded as "dead" after 10 minutes of inactivity. If a DataNode is down for < 10 minutes, and quickly returns, then nothing happens. After 10 minutes, its blocks are assumed to be lost forever. So the NameNode then begins scheduling those blocks for re-replication from the two surviving copies. There is no real-time bound on how long this process takes. It's a function of your available network bandwidth, server loads, disk speeds, amount of storage used, etc. Blocks which are down to a single surviving replica are prioritized over blocks which have two surviving replicas (in the event of a simultaneous or near-simultaneous double fault), since they are more vulnerable. If a DataNode does reappear, then its re-replication is cancelled, and over-provisioned blocks are scaled back to the target number of replicas. If that machine comes back with all blocks intact, then the node needs no "rebuilding." (In fact, some of the over-provisioned replicas that get removed might be from the original node, if they're available elsewhere too!) Don't forget that machines in Hadoop do not have strong notions of identity. If a particular machine is taken offline and its disks are wiped, the blocks which were there (which also existed in two other places) will be re-replicated elsewhere from the live copies. When that same machine is then brought back online, it has no incentive to "copy back" all the blocks that it used to have, as there will be three replicas elsewhere in the cluster. Blocks are never permanently bound to particular machines. If you add recommissed or new nodes to a cluster, you should run the rebalancing script which will take a random sampling of blocks from heavily-laden nodes and move them onto emptier nodes in an attempt to spread the data as evenly as possible. - Aaron On Tue, May 26, 2009 at 3:08 PM, Joe Hammerman wrote: > Hello Hadoop Users list: > >We are running Hadoop version 0.18.2. My team lead has asked > me to investigate the answer to a particular question regarding Hadoop's > handling of offline DataNodes - specifically, we would like to know how long > a node can be offline before it is totally rebuilt when it has been readded > to the cluster. >From what I've been able to determine from the documentation > it appears to me that the NameNode will simply begin scheduling block > replication on its remaining cluster members. If the offline node comes back > online, and it reports all its blocks as being uncorrupted, then the > NameNode just cleans up the "extra" blocks. >In other words, there is no explicit handling based on the > length of the outage - the behavior of the cluster will depend entirely on > the outage duration. > >Anyone care to shed some light on this? > >Thanks! > Regards, >Joseph Hammerman >
Re: Username in Hadoop cluster
A slightly longer answer: If you're starting daemons with bin/start-dfs.sh or start-all.sh, you'll notice that these defer to hadoop-daemons.sh to do the heavy lifting. This evaluates the string: cd "$HADOOP_HOME" \; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "$@" and passes it to an underlying loop to execute on all the slaves via ssh. $bin and $HADOOP_HOME are thus macro-replaced on the server-side. The more problematic one here is the $bin one, which resolves to the absolute path of the cwd on the server that is starting Hadoop. You've got three basic options: 1) Install Hadoop in the exact same path on all nodes 2) Modify bin/hadoop-daemons.sh to do something more clever on your system by deferring evaluation of HADOOP_HOME and the bin directory (probably really hairy; you might have to escape the variable names more than once since there's another script named slaves.sh that this goes through) 3) Start the slaves "manually" on each node by logging in yourself, and doing a "cd $HADOOP_HOME && bin/hadoop-daemon.sh datanode start" As a shameless plug, Cloudera's distribution for Hadoop ( www.cloudera.com/hadoop) will also provide init.d scripts so that you can start Hadoop daemons via the 'service' command. By default, the RPM installation will also standardize on the "hadoop" username. But you can't install this without being root. - Aaron On Tue, May 26, 2009 at 12:30 PM, Alex Loddengaard wrote: > It looks to me like you didn't install Hadoop consistently across all > nodes. > > xxx.xx.xx.251: bash: > > /home/utdhadoop1/Hadoop/ > > hadoop-0.18.3/bin/hadoop-daemon.sh: No such file or > directory > > The above makes me suspect that xxx.xx.xx.251 has Hadoop installed > differently. Can you try and locate hadoop-daemon.sh on xxx.xx.xx.251 and > adjust its location properly? > > Alex > > On Mon, May 25, 2009 at 10:25 PM, Pankil Doshi > wrote: > > > Hello, > > > > I tried adding "usern...@hostname" for eachentry in slaves file. > > > > My slave file have 2 data nodes.it looks like below > > > > localhost > > utdhado...@xxx.xx.xx.229 > > utdhad...@xxx.xx.xx.251 > > > > > > error what I get when i start dfs is as below: > > > > starting namenode, logging to > > > > > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-utdhadoop1-namenode-opencirrus-992.hpl.hp.com.out > > xxx.xx.xx.229: starting datanode, logging to > > > > > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-utdhadoop1-datanode-opencirrus-992.hpl.hp.com.out > > *xxx.xx.xx.251: bash: line 0: cd: > > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/..: No such file or directory > > xxx.xx.xx.251: bash: > > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/hadoop-daemon.sh: No such file > or > > directory > > *localhost: datanode running as process 25814. Stop it first. > > xxx.xx.xx.229: starting secondarynamenode, logging to > > > > > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-utdhadoop1-secondarynamenode-opencirrus-992.hpl.hp.com.out > > localhost: secondarynamenode running as process 25959. Stop it first. > > > > > > > > Basically it looks for "* /home/utdhadoop1/Hadoop/** > > hadoop-0.18.3/bin/hadoop-**daemon.sh" > > but instead it should look for "/home/utdhadoop/Hadoop/" as > > xxx.xx.xx.251 has username as utdhadoop* . > > > > Any inputs?? > > > > Thanks > > Pankil > > > > On Wed, May 20, 2009 at 6:30 PM, Todd Lipcon wrote: > > > > > On Wed, May 20, 2009 at 4:14 PM, Alex Loddengaard > > > wrote: > > > > > > > First of all, if you can get all machines to have the same user, that > > > would > > > > greatly simplify things. > > > > > > > > If, for whatever reason, you absolutely can't get the same user on > all > > > > machines, then you could do either of the following: > > > > > > > > 1) Change the *-all.sh scripts to read from a slaves file that has > two > > > > fields: a host and a user > > > > > > > > > To add to what Alex said, you should actually already be able to do > this > > > with the existing scripts by simply using the format "usern...@hostname > " > > > for > > > each entry in the slaves file. > > > > > > -Todd > > > > > >
Announcement: Cloudera Hadoop Training in Washington DC (June 22-23)
Hadoop Fans, just a quick note that we are hosting two days of Hadoop training in Washington DC area (Alexandria, VA) on June 22 and 23. We cover Hadoop, Hive, Pig and more with a focus on hands-on work. Please share this with your friends who might not be on the user list yet. Both are listed under "Live Sessions" at http://www.cloudera.com/hadoop-training Registration is discounted for the next few days, so if you are interested, feel free to take advantage of the savings ($200) by registering early. We also have room for one more private session later in the week. Please contact me directly if you are interested in a customized, private session for your organization. Christophe -- get hadoop: cloudera.com/hadoop online training: cloudera.com/hadoop-training blog: cloudera.com/blog twitter: twitter.com/cloudera
Re: Bisection bandwidth needed for a small Hadoop cluster
Hi Stephen, The true answer depends on the types of jobs you're running. As a back of the envelope calculation I might figure something like this: 60 nodes total = 30 nodes per rack Each node might process about 100MB/sec of data In the case of a sort job where the intermediate data is the same size as the input data, that means each node needs to shuffle 100MB/sec of data In aggregate, each rack is then producing about 3GB/sec of data However, given even reducer spread across the racks, each rack will need to send 1.5GB/sec to reducers running on the other rack. Since the connection is full duplex, that means you need 1.5GB/sec of bisection bandwidth for this theoretical job. So that's 12Gbps. However, the above calculations are probably somewhat of an upper bound. A large number of jobs have significant data reduction during the map phase, either by some kind of filtering/selection going on in the Mapper itself, or by good usage of Combiners. Additionally, intermediate data compression can cut the intermediate data transfer by a significant factor. Lastly, although your disks can probably provide 100MB sustained throughput, it's rare to see a MR job which can sustain disk speed IO through the entire pipeline. So, I'd say my estimate is at least a factor of 2 too high. So, the simple answer is that 4-6Gbps is most likely just fine for most practical jobs. If you want to be extra safe, many inexpensive switches can operate in a "stacked" configuration where the bandwidth between them is essentially backplane speed. That should scale you to 96 nodes with plenty of headroom. -Todd On Tue, May 26, 2009 at 3:10 AM, stephen mulcahy wrote: > Hi, > > Has anyone here investigated what level of bisection bandwidth is needed > for a Hadoop cluster which spans more than one rack? > > I'm currently sizing and planning a new Hadoop cluster and I'm wondering > what the performance implications will be if we end up with a cluster spread > across two racks. I'd expect we'll have one 48-port gigabit switch in each > 42u rack. If we end up with 60 systems spread across these two switches - > how much bandwidth should I have between the racks? > > I'll have 6 gigabit ports available for links between racks - i.e. up to 6 > Gbps. Would this be sufficient bisection bandwidth for Hadoop or should I be > considering increased bandwidth between racks (maybe using fibre links > between the switches or introducing another switch)? > > Thanks for any thoughts on this. > > -stephen > > -- > Stephen Mulcahy, DI2, Digital Enterprise Research Institute, > NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland > http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com >
Question regarding dwontime model for DataNodes
Hello Hadoop Users list: We are running Hadoop version 0.18.2. My team lead has asked me to investigate the answer to a particular question regarding Hadoop's handling of offline DataNodes - specifically, we would like to know how long a node can be offline before it is totally rebuilt when it has been readded to the cluster. From what I've been able to determine from the documentation it appears to me that the NameNode will simply begin scheduling block replication on its remaining cluster members. If the offline node comes back online, and it reports all its blocks as being uncorrupted, then the NameNode just cleans up the "extra" blocks. In other words, there is no explicit handling based on the length of the outage - the behavior of the cluster will depend entirely on the outage duration. Anyone care to shed some light on this? Thanks! Regards, Joseph Hammerman
get mapper number in pipes
Hi, I am trying to figure out if it is possible to extract the number of the mapper running in the actual mapper of a pipes job. There are occasionally times when I want to rerun a job but not have certain mappers emit data depending on what they did the previous time. I can pass in which mappers to skip in a config file but can't figure out how to find the number of the mapper currently running. Any help will be greatly appreciated. Thanks, -Brian
MultipleOutputs / Hadoop 0.20.0?
Greetings, list: I'm part of a project just getting started using Hadoop-core, and we figured we may as well start with version 0.20.0 due to the refactoring of the Mapper and Reducer classes. We'd also like to make use of MultipleOutputs, or something with a similar functionality. However, it depends upon the now-deprecated JobConf class, as opposed to using Job as seems preferred by 0.20.0. Does the ability to define and configure multiple outputs exist in a 0.20.0-friendly form? Perhaps rolled into a OutputFormat somewhere? Using the deprecated classes isn't too big of a deal; I'm just wondering if this feature was updated in the 0.20.0 implementation. Thanks, Dan
Re: InputStream.open() efficiency
Stas Oskin wrote: Hi. Thanks for the answer. Would up to 5 minute of handlers cause any issues? 5 min should not cause any issues.. And same about writing? writing is not affected by the couple of issues I mentioned. Writing over a long time should work as well as writing over shorter time. Raghu. Regards. 2009/5/26 Raghu Angadi 'in.seek(); in.read()' is certainly better than, 'in = fs.open(); in.seek(); in.read()' The difference is is exactly one open() call. So you would save an RPC to NameNode. There are couple of issues that affect apps that keep the handlers open very long time (many hours to days).. but those will be fixed soon. Raghu. Stas Oskin wrote: Hi. I'm looking to find out, how the InputStream.open() + skip(), compares to keeping a handle of InputStream() and just seeking the position. Has anyone compared these approaches, and can advice on their speed? Regards.
Re: InputStream.open() efficiency
Hi. Thanks for the answer. Would up to 5 minute of handlers cause any issues? And same about writing? Regards. 2009/5/26 Raghu Angadi > > 'in.seek(); in.read()' is certainly better than, > 'in = fs.open(); in.seek(); in.read()' > > The difference is is exactly one open() call. So you would save an RPC to > NameNode. > > There are couple of issues that affect apps that keep the handlers open > very long time (many hours to days).. but those will be fixed soon. > > Raghu. > > > Stas Oskin wrote: > >> Hi. >> >> I'm looking to find out, how the InputStream.open() + skip(), compares to >> keeping a handle of InputStream() and just seeking the position. >> >> Has anyone compared these approaches, and can advice on their speed? >> >> Regards. >> >> >
Re: Username in Hadoop cluster
It looks to me like you didn't install Hadoop consistently across all nodes. xxx.xx.xx.251: bash: > /home/utdhadoop1/Hadoop/ hadoop-0.18.3/bin/hadoop-daemon.sh: No such file or directory The above makes me suspect that xxx.xx.xx.251 has Hadoop installed differently. Can you try and locate hadoop-daemon.sh on xxx.xx.xx.251 and adjust its location properly? Alex On Mon, May 25, 2009 at 10:25 PM, Pankil Doshi wrote: > Hello, > > I tried adding "usern...@hostname" for eachentry in slaves file. > > My slave file have 2 data nodes.it looks like below > > localhost > utdhado...@xxx.xx.xx.229 > utdhad...@xxx.xx.xx.251 > > > error what I get when i start dfs is as below: > > starting namenode, logging to > > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-utdhadoop1-namenode-opencirrus-992.hpl.hp.com.out > xxx.xx.xx.229: starting datanode, logging to > > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-utdhadoop1-datanode-opencirrus-992.hpl.hp.com.out > *xxx.xx.xx.251: bash: line 0: cd: > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/..: No such file or directory > xxx.xx.xx.251: bash: > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/hadoop-daemon.sh: No such file or > directory > *localhost: datanode running as process 25814. Stop it first. > xxx.xx.xx.229: starting secondarynamenode, logging to > > /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-utdhadoop1-secondarynamenode-opencirrus-992.hpl.hp.com.out > localhost: secondarynamenode running as process 25959. Stop it first. > > > > Basically it looks for "* /home/utdhadoop1/Hadoop/** > hadoop-0.18.3/bin/hadoop-**daemon.sh" > but instead it should look for "/home/utdhadoop/Hadoop/" as > xxx.xx.xx.251 has username as utdhadoop* . > > Any inputs?? > > Thanks > Pankil > > On Wed, May 20, 2009 at 6:30 PM, Todd Lipcon wrote: > > > On Wed, May 20, 2009 at 4:14 PM, Alex Loddengaard > > wrote: > > > > > First of all, if you can get all machines to have the same user, that > > would > > > greatly simplify things. > > > > > > If, for whatever reason, you absolutely can't get the same user on all > > > machines, then you could do either of the following: > > > > > > 1) Change the *-all.sh scripts to read from a slaves file that has two > > > fields: a host and a user > > > > > > To add to what Alex said, you should actually already be able to do this > > with the existing scripts by simply using the format "usern...@hostname" > > for > > each entry in the slaves file. > > > > -Todd > > >
Re: InputStream.open() efficiency
'in.seek(); in.read()' is certainly better than, 'in = fs.open(); in.seek(); in.read()' The difference is is exactly one open() call. So you would save an RPC to NameNode. There are couple of issues that affect apps that keep the handlers open very long time (many hours to days).. but those will be fixed soon. Raghu. Stas Oskin wrote: Hi. I'm looking to find out, how the InputStream.open() + skip(), compares to keeping a handle of InputStream() and just seeking the position. Has anyone compared these approaches, and can advice on their speed? Regards.
Re: When directly writing to HDFS, the data is moved only on file close
Hi. You probably referring to the following paragraph? After some back and forth over a set of slides presented by Sanjay on work being done by Hairong as part of HADOOP-5744, "Revising append", the room settled on API3 from the list of options below as the priority feature needed by HADOOP 0.21.0. Readers must be able to read up to the last writer 'successful' flush. Its not important that the file length is 'inexact'. If I'm understand correctly, this, means the data actually gets written to cluster - but it's not visible until the block is closed. Work is ongoing to allow in version 0.21 to make the file visible on flush(). Am I correct up to here? Regards. 2009/5/26 Tom White > > This feature is not available yet, and is still under active >> discussion. (The current version of HDFS will make the previous block >> available to readers.) Michael Stack gave a good summary on the HBase >> dev list: >> >> >> http://mail-archives.apache.org/mod_mbox/hadoop-hbase-dev/200905.mbox/%3c7c962aed0905231601g533088ebj4a7a068505ba3...@mail.gmail.com%3e >> >> Tom >> >> On Tue, May 26, 2009 at 12:08 PM, Stas Oskin >> wrote: >> > Hi. >> > >> > I'm trying to continuously write data to HDFS via OutputStream(), and >> want >> > to be able to read it at the same time from another client. >> > >> > Problem is, that after the file is created on HDFS with size of 0, it >> stays >> > that way, and only fills up when I close the OutputStream(). >> > >> > Here is a simple code sample illustrating this issue: >> > >> > try { >> > >> >FSDataOutputStream out=fileSystem.create(new >> > Path("/test/test.bin")); // Here the file created with 0 size >> >for(int i=0;i<1000;i++) >> >{ >> >out.write(1); // Still stays 0 >> >out.flush(); // Even when I flush it out??? >> >} >> > >> >Thread.currentThread().sleep(1); >> >out.close(); //Only here the file is updated >> >} catch (Exception e) { >> >e.printStackTrace(); >> >} >> > >> > So, two questions here: >> > >> > 1) How it's possible to write the files directly to HDFS, and have them >> > update there immedaitely? >> > 2) Just for information, in this case, where the file content stays all >> the >> > time - on server local disk, in memory, etc...? >> > >> > Thanks in advance. >> > >> >
Re: ssh issues
On 5/26/09 3:40 AM, "Steve Loughran" wrote: > HDFS is as secure as NFS: you are trusted to be who you say you are. > Which means that you have to run it on a secured subnet -access > restricted to trusted hosts and/or one two front end servers or accept > that your dataset is readable and writeable by anyone on the network. > > There is user identification going in; it is currently at the level > where it will stop someone accidentally deleting the entire filesystem > if they lack the rights. Which has been known to happen. Actually, I'd argue that HDFS is worse than even rudimentary NFS implementations. Off the top of my head: a) There is no equivalent of squash root/force anonymous. Any host can assume privilege. b) There is no 'read only from these hosts'. If you can read blocks over Hadoop RPC, you can write as well (minus safe mode).
Re: Setting up another machine as secondary node
Hi Rakhi, This is because your name-node is trying to -importCheckpoint from a directory, which is locked by secondary name-node. The secondary node is also running in your case, right? You should use -importCheckpoint as the last resort, when name-node's directories are damaged. In regular case you start name-node with ./hadoop-daemon.sh start namenode Thanks, --Konstantin Rakhi Khatwani wrote: Hi, I followed the instructions suggested by you all. but i still come across this exception when i use the following command: ./hadoop-daemon.sh start namenode -importCheckpoint the exception is as follows: 2009-05-26 14:43:48,004 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = germapp/192.168.0.1 STARTUP_MSG: args = [-importCheckpoint] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2009-05-26 14:43:48,147 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=4 2009-05-26 14:43:48,154 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: germapp/192.168.0.1:4 2009-05-26 14:43:48,160 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-05-26 14:43:48,166 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-26 14:43:48,316 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=ithurs,ithurs 2009-05-26 14:43:48,317 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-05-26 14:43:48,317 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-05-26 14:43:48,343 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-26 14:43:48,347 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2009-05-26 14:43:48,455 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /tmp/hadoop-ithurs/dfs/name is not formatted. 2009-05-26 14:43:48,455 INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting ... 2009-05-26 14:43:48,457 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. 2009-05-26 14:43:48,460 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:273) at org.apache.hadoop.hdfs.server.namenode.FSImage.doImportCheckpoint(FSImage.java:504) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:344) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:290) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868) 2009-05-26 14:43:48,464 INFO org.apache.hadoop.ipc.Server: Stopping server on 4 2009-05-26 14:43:48,466 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:273) at org.apache.hadoop.hdfs.server.namenode.FSImage.doImportCheckpoint(FSImage.java:504) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:344) at org.apache.hadoop.hdfs.server.namenode.FSDir
PNW Hadoop + Apache Cloud Stack Meetup, Wed. May 27th:
Greetings, This is a friendly reminder that the 1st meetup for the PNW Hadoop + Apache Cloud Stack User Group is THIS WEDNESDAY at 6:45pm. We're very excited to have everyone attend! University of Washington, Allen Center Room 303, at 6:45pm on Wednesday, May 27, 2009. I'm going to put together a map, and a wiki so we can collab. The Allen Center is located here: http://www.washington.edu/home/maps/?CSE What I'm envisioning is a meetup for about 2 hours: we'll have two in-depth talks of 15-20 minutes each, and then several "lightning talks" of 5 minutes. We'll then have discussion and 'social time'. Let me know if you're interested in speaking or attending. I'd like to focus on education, so every presentation *needs* to ask some questions at the end. We can talk about these after the presentations, and I'll record what we've learned in a wiki and share that with the rest of us. Looking forward to meeting you all! Cheers, Bradford Stephens
Re: Is there any performance issue with Jrockit JVM for Hadoop
Yep. I have tried the option apred.job.reuse.jvm.num.tasks with -1. There is no big difference. Furthermore, even using this option with -1, there are still several task jvms besides the tasktracker jvm. On Mon, May 18, 2009 at 9:34 PM, Steve Loughran wrote: > Tom White wrote: > >> On Mon, May 18, 2009 at 11:44 AM, Steve Loughran >> wrote: >> >>> Grace wrote: >>> To follow up this question, I have also asked help on Jrockit forum. They kindly offered some useful and detailed suggestions according to the JRA results. After updating the option list, the performance did become better to some extend. But it is still not comparable with the Sun JVM. Maybe, it is due to the use case with short duration and different implementation in JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM currently. Thanks all for your time and help. what about flipping the switch that says "run tasks in the TT's own >>> JVM?". >>> That should handle startup costs, and reduce the memory footprint >>> >>> >> The property mapred.job.reuse.jvm.num.tasks allows you to set how many >> tasks the JVM may be reused for (within a job), but it always runs in >> a separate JVM to the tasktracker. (BTW >> https://issues.apache.org/jira/browse/HADOOP-3675has some discussion >> about running tasks in the tasktracker's JVM). >> >> Tom >> > > Tom, > that's why you are writing a book on Hadoop and I'm not ...you know the > answers and I have some vague misunderstandings, > > -steve > (returning to the svn book) >
Re: HDFS data for HBase and other projects
On Tue, May 26, 2009 at 2:28 AM, Wasim Bari wrote: > Hi, > If we have already data stored in HDFS. Which of following sub-projects > can use this data for further processing/operations: > > 1.. Pig > Yes 2.. HBase > No, needs to be imported into HBase 3.. ZooKeeper > No, not used for processing data 4.. Hive > Yes, if it's in a well defined format > 5.. Any other Hadoop related project > Maybe? Sorry for the vague answer, but you'll have to ask a specific question to get a specific answer. I recommend reading the web pages for the projects and checking out the "Getting Started" guides. -Todd
Re: When directly writing to HDFS, the data is moved only on file close
Hi. Does it means there is no way to access the data being written to HDFS, while it's written? Where it's stored then via the writing - on cluster or on local disks? Thanks. 2009/5/26 Tom White > This feature is not available yet, and is still under active > discussion. (The current version of HDFS will make the previous block > available to readers.) Michael Stack gave a good summary on the HBase > dev list: > > > http://mail-archives.apache.org/mod_mbox/hadoop-hbase-dev/200905.mbox/%3c7c962aed0905231601g533088ebj4a7a068505ba3...@mail.gmail.com%3e > > Tom > > On Tue, May 26, 2009 at 12:08 PM, Stas Oskin wrote: > > Hi. > > > > I'm trying to continuously write data to HDFS via OutputStream(), and > want > > to be able to read it at the same time from another client. > > > > Problem is, that after the file is created on HDFS with size of 0, it > stays > > that way, and only fills up when I close the OutputStream(). > > > > Here is a simple code sample illustrating this issue: > > > > try { > > > >FSDataOutputStream out=fileSystem.create(new > > Path("/test/test.bin")); // Here the file created with 0 size > >for(int i=0;i<1000;i++) > >{ > >out.write(1); // Still stays 0 > >out.flush(); // Even when I flush it out??? > >} > > > >Thread.currentThread().sleep(1); > >out.close(); //Only here the file is updated > >} catch (Exception e) { > >e.printStackTrace(); > >} > > > > So, two questions here: > > > > 1) How it's possible to write the files directly to HDFS, and have them > > update there immedaitely? > > 2) Just for information, in this case, where the file content stays all > the > > time - on server local disk, in memory, etc...? > > > > Thanks in advance. > > >
Re: When directly writing to HDFS, the data is moved only on file close
This feature is not available yet, and is still under active discussion. (The current version of HDFS will make the previous block available to readers.) Michael Stack gave a good summary on the HBase dev list: http://mail-archives.apache.org/mod_mbox/hadoop-hbase-dev/200905.mbox/%3c7c962aed0905231601g533088ebj4a7a068505ba3...@mail.gmail.com%3e Tom On Tue, May 26, 2009 at 12:08 PM, Stas Oskin wrote: > Hi. > > I'm trying to continuously write data to HDFS via OutputStream(), and want > to be able to read it at the same time from another client. > > Problem is, that after the file is created on HDFS with size of 0, it stays > that way, and only fills up when I close the OutputStream(). > > Here is a simple code sample illustrating this issue: > > try { > > FSDataOutputStream out=fileSystem.create(new > Path("/test/test.bin")); // Here the file created with 0 size > for(int i=0;i<1000;i++) > { > out.write(1); // Still stays 0 > out.flush(); // Even when I flush it out??? > } > > Thread.currentThread().sleep(1); > out.close(); //Only here the file is updated > } catch (Exception e) { > e.printStackTrace(); > } > > So, two questions here: > > 1) How it's possible to write the files directly to HDFS, and have them > update there immedaitely? > 2) Just for information, in this case, where the file content stays all the > time - on server local disk, in memory, etc...? > > Thanks in advance. >
InputStream.open() efficiency
Hi. I'm looking to find out, how the InputStream.open() + skip(), compares to keeping a handle of InputStream() and just seeking the position. Has anyone compared these approaches, and can advice on their speed? Regards.
When directly writing to HDFS, the data is moved only on file close
Hi. I'm trying to continuously write data to HDFS via OutputStream(), and want to be able to read it at the same time from another client. Problem is, that after the file is created on HDFS with size of 0, it stays that way, and only fills up when I close the OutputStream(). Here is a simple code sample illustrating this issue: try { FSDataOutputStream out=fileSystem.create(new Path("/test/test.bin")); // Here the file created with 0 size for(int i=0;i<1000;i++) { out.write(1); // Still stays 0 out.flush(); // Even when I flush it out??? } Thread.currentThread().sleep(1); out.close(); //Only here the file is updated } catch (Exception e) { e.printStackTrace(); } So, two questions here: 1) How it's possible to write the files directly to HDFS, and have them update there immedaitely? 2) Just for information, in this case, where the file content stays all the time - on server local disk, in memory, etc...? Thanks in advance.
Re: ssh issues
hmar...@umbc.edu wrote: Steve, Security through obscurity is always a good practice from a development standpoint and one of the reasons why tricking you out is an easy task. :) My most recent presentation on HDFS clusters is now online, notice how it doesn't gloss over the security: http://www.slideshare.net/steve_l/hdfs-issues Please, keep hiding relevant details from people in order to keep everyone smiling. HDFS is as secure as NFS: you are trusted to be who you say you are. Which means that you have to run it on a secured subnet -access restricted to trusted hosts and/or one two front end servers or accept that your dataset is readable and writeable by anyone on the network. There is user identification going in; it is currently at the level where it will stop someone accidentally deleting the entire filesystem if they lack the rights. Which has been known to happen. If the team looking after the cluster demand separate SSH keys/login for every machine then not only are they making their operations costs high, once you have got the HDFS cluster and MR engine live, it's moot. You can push out work to the JobTracker, which then runs it on the machines, under whatever userid the TaskTrackers are running on. Now, 0.20+ will run it under the identity of the user who claimed to be submitting the job, but without that, your MR Jobs get the access rights to the filesystem of the user that is running the TT, but it's fairly straightforward to create a modified hadoop client JAR that doesn't call whoami to get the userid, and instead spoofs to be anyone. Which means that even if you lock down the filesystem -no out of datacentre access-, if I can run my java code as MR jobs in your cluster, I can have unrestricted access to the filesystem by way of the task tracker server. But Hal, if you are running Ant for your build I'm running my code on your machines anyway, so you had better be glad that I'm not malicious. -Steve
Bisection bandwidth needed for a small Hadoop cluster
Hi, Has anyone here investigated what level of bisection bandwidth is needed for a Hadoop cluster which spans more than one rack? I'm currently sizing and planning a new Hadoop cluster and I'm wondering what the performance implications will be if we end up with a cluster spread across two racks. I'd expect we'll have one 48-port gigabit switch in each 42u rack. If we end up with 60 systems spread across these two switches - how much bandwidth should I have between the racks? I'll have 6 gigabit ports available for links between racks - i.e. up to 6 Gbps. Would this be sufficient bisection bandwidth for Hadoop or should I be considering increased bandwidth between racks (maybe using fibre links between the switches or introducing another switch)? Thanks for any thoughts on this. -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
HDFS data for HBase and other projects
Hi, If we have already data stored in HDFS. Which of following sub-projects can use this data for further processing/operations: 1.. Pig 2.. HBase 3.. ZooKeeper 4.. Hive 5.. Any other Hadoop related project Thanks, Wasim
Re: Setting up another machine as secondary node
Hi, I followed the instructions suggested by you all. but i still come across this exception when i use the following command: ./hadoop-daemon.sh start namenode -importCheckpoint the exception is as follows: 2009-05-26 14:43:48,004 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = germapp/192.168.0.1 STARTUP_MSG: args = [-importCheckpoint] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2009-05-26 14:43:48,147 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=4 2009-05-26 14:43:48,154 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: germapp/192.168.0.1:4 2009-05-26 14:43:48,160 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-05-26 14:43:48,166 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-26 14:43:48,316 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=ithurs,ithurs 2009-05-26 14:43:48,317 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-05-26 14:43:48,317 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-05-26 14:43:48,343 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-26 14:43:48,347 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2009-05-26 14:43:48,455 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /tmp/hadoop-ithurs/dfs/name is not formatted. 2009-05-26 14:43:48,455 INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting ... 2009-05-26 14:43:48,457 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. 2009-05-26 14:43:48,460 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:273) at org.apache.hadoop.hdfs.server.namenode.FSImage.doImportCheckpoint(FSImage.java:504) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:344) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:290) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868) 2009-05-26 14:43:48,464 INFO org.apache.hadoop.ipc.Server: Stopping server on 4 2009-05-26 14:43:48,466 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: Cannot lock storage /tmp/hadoop-ithurs/dfs/namesecondary. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:273) at org.apache.hadoop.hdfs.server.namenode.FSImage.doImportCheckpoint(FSImage.java:504) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:344) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:290) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208) a
Re: Specifying NameNode externally to hadoop-site.xml
Hi. Thanks for the tip. Regards. 2009/5/26 Aaron Kimball > Same way. > > Configuration conf = new Configuration(); > conf.set("fs.default.name", "hdfs://foo"); > FileSystem fs = FileSystem.get(conf); > > - Aaron > > On Mon, May 25, 2009 at 1:02 PM, Stas Oskin wrote: > > > Hi. > > > > And if I don't use jobs but only DFS for now? > > > > Regards. > > > > 2009/5/25 jason hadoop > > > > > conf.set("fs.default.name", "hdfs://host:port"); > > > where conf is the JobConf object of your job, before you submit it. > > > > > > > > > On Mon, May 25, 2009 at 10:16 AM, Stas Oskin > > wrote: > > > > > > > Hi. > > > > > > > > Thanks for the tip, but is it possible to set this in dynamic way via > > > code? > > > > > > > > Thanks. > > > > > > > > 2009/5/25 jason hadoop > > > > > > > > > if you launch your jobs via bin/hadoop jar jar_file [main class] > > > > [options] > > > > > > > > > > you can simply specify -fs hdfs://host:port before the jar_file > > > > > > > > > > On Sun, May 24, 2009 at 3:02 PM, Stas Oskin > > > > wrote: > > > > > > > > > > > Hi. > > > > > > > > > > > > I'm looking to move the Hadoop NameNode URL outside the > > > hadoop-site.xml > > > > > > file, so I could set it at the run-time. > > > > > > > > > > > > Any idea how to do it? > > > > > > > > > > > > Or perhaps there is another configuration that can be applied to > > the > > > > > > FileSystem object? > > > > > > > > > > > > Regards. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Alpha Chapters of my book on Hadoop are available > > > > > http://www.apress.com/book/view/9781430219422 > > > > > www.prohadoopbook.com a community for Hadoop Professionals > > > > > > > > > > > > > > > > > > > > > -- > > > Alpha Chapters of my book on Hadoop are available > > > http://www.apress.com/book/view/9781430219422 > > > www.prohadoopbook.com a community for Hadoop Professionals > > > > > >