Re: configuration access
Arun C Murthy wrote: Can you re-check if the right paths (for your config files) are on the CLASSPATH? That was it. Thanks. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: Amazon S3 questions
Chris K Wensel wrote: do you have an underscore in your bucket name? Yes I do. Here's a sample error message/stack trace. This is version 0.16.0: localhost: Exception in thread main java.lang.IllegalArgumentException: port out of range:-1 localhost: at java.net.InetSocketAddress.init(InetSocketAddress.java:118) localhost: at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:125) localhost: at org.apache.hadoop.dfs.SecondaryNameNode.init(SecondaryNameNode.java:94) localhost: at org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:492) -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: Amazon S3 questions
do you have an underscore in your bucket name? Yes I do. Sorry I was wrong -- no underscores. Some do but this particular one uses only dashes. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Amazon S3 questions
I have some confusion over the use of Amazon S3 as storage. I was looking at the fs.default.name as the name node -- a host and a port the client uses to ask the name node to perform DFS services. But for Amazon S3 you give it an S3 bucket URL, which is really just a direct pointer to the storage. So it seems fs.default.name is really just a storage setting that happens to be a service (host/port) for HDFS. Even though S3 is also a service, it can't also be a name node. If that's true, where is the host/port of the name node configured separate from fs.default.name? It would seem no matter what the underlying storage you'd need a name node service somewhere. I'm probably missing something bigger here ... -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: Amazon S3 questions
Bradford Stephens wrote: What sort of performance hit is there for using S3 vs. a local cluster? It probably only makes sense speed-wise if you're running on EC2. S3 access from EC2 is a lot faster than accessing it from outside the Amazon cloud. If you run on EC2, S3 is essentially your persistent storage. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: long write operations and data recovery
Ted Dunning wrote: In our case, we looked at the problem and decided that Hadoop wasn't feasible for our real-time needs in any case. There were several issues, - first, of all, map-reduce itself didn't seem very plausible for real-time applications. That left hbase and hdfs as the capabilities offered by hadoop (for real-time stuff) We'll be using map-reduce batch mode, so we're okay there. The upshot is that we use hadoop extensively for batch operations where it really shines. The other nice effect is that we don't have to worry all that much about HA (at least not real-time HA) since we don't do real-time with hadoop. What I'm struggling with is the write side of things. We'll have a huge amount of data to write that's essentially a log format. It would seem that writing that outside of HDFS then trying to batch import it would be a losing battle -- that you would need the distributed nature of HDFS to do very large volume writes directly and wouldn't easily be able to take some other flat storage model and feed it in as a secondary step without having the HDFS side start to lag behind. The realization is that Name Node could go down so we'll have to have a backup store that might be used during temporary outages, but that most of the writes would be direct HDFS updates. The alternative would seem to be to end up with a set of distributed files without some unifying distributed file system (e.g., like lots of Apache web logs on many many individual boxes) and then have to come up with some way to funnel those back into HDFS. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: Cross-data centre DFS communication?
Owen O'Malley wrote: To copy between clusters, there is a tool called distcp. Look at bin/hadoop distcp. It runs a map/reduce job that copies a group of files. It can also be used to copy between versions of hadoop, if the source file system is hftp, which uses xml to read hdfs. Can you further explain the hftp part of this? I'm not familiar with that. We have a similar need to go cross-data center. In an earlier post it was suggested that there was no map/reduce model for that so this sounds more like what we're looking for. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: Cross-data centre DFS communication?
Owen O'Malley wrote: Sure, the info server on the name node of HDFS has a read-only interface that lists directories in xml and allows the client to read files over http. There is a FileSystem implementation that provides the client side interface to the xml/http access. To use it, you need a path with hftp as the protocol: hadoop distcp hftp://namenode1:50070/foo/bar hdfs://namenode2:8020/foo Very useful. Thanks. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: long write operations and data recovery
dhruba Borthakur wrote: The Namenode maintains a lease for every open file that is being written to. If the client that was writing to the file disappears, the Namenode will do lease recovery after expiry of the lease timeout (1 hour). The lease recovery process (in most cases) will remove the last block from the file (it was not fully written because the client crashed before it could fill up the block) and close the file. How does replication affect this? If there's at least one replicated client still running, I assume that takes care of it? -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: long write operations and data recovery
How does replication affect this? If there's at least one replicated client still running, I assume that takes care of it? Never mind -- I get this now after reading the docs again. My remaining point of failure question concerns name nodes. The docs say manual intervention is still required if a name node goes down. How is this typically managed in production environments? It would seem even a short name node outage in a data intestive environment would lead to data loss (no name node to give the data to). -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Local testing and DHCP
When running in Pseudo Distributed mode as outlined in the Quickstart, I see that the DFS is, at some level, identified by the IP address it was created under. I''m doing this on a laptop and when I take it to another network, the daemons come up okay but they can't find the DFS. It looks like it's because the IP is different from when the DFS was first created. Is there a way around this so I can run on the same box and see the same DFS regardless of what its IP is? -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: Local testing and DHCP
Joydeep Sen Sarma wrote: a few of our nodes had (for inexplicable reasons) bound to localhost.localdomain for a while. definitely for map-reduce - this cases problems (not sure about hdfs). jobs were failing saying they could not find 'localhost.localdomain' (i think this was in the reduce copy phase trying to contact map outputs). i am not terribly sure of the details - but there are issues with this .. I have a situation now, like I've seen before, where my config. is exactly like it was yesterday but something about my network set-up is different and I can't get the pseudo distributed copy to come up at all on the box. It looks like the name node is out there but the URL tries going to the dfshealth.jsp page and that fails with a 404 error. Very frustrating, as I'm something spending hours trying to get a local test set up to install the same way it did the day before. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
long write operations and data recovery
If I have a write operation that takes a while between opening and closing the file, what is the effect of a node doing that writing crashing in the middle? For example, suppose I have large logs that I write to continually, rolling them every N minutes (say every hour for the sake of discussion). If I have the file opened and am 90% done my writes and things crash, what happens to the data I've written ... realizing that at some level, data isn't visible to the rest of the cluster until the file is closed. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: map/reduce across a WAN
Ted Dunning wrote: I think the best way to accomplish this sort of goal is to go ahead and run independent clusters and somehow add the ability to propagate files between clusters. Then the cross-cluster map-reduce can run in the cluster that has originals or replicas of all of the necessary files. The problem is the amount of data. We're using HDFS because the volume will be huge. On the surface, replicating files across data centers would appear to take something that's sized to need map/reduce and force it down a serial sort of pipe, over a slow connection. Then we'd have to fan it back out to HDFS at each replicated site in order to crunch the data. That's why I was thinking if map/reduce pushes most of the work off to the individual nodes where the data resides, that somehow map/reducing across the entire set of boxes in all data centers made more sense -- that would seem to reduce the volume of data flowing between data centers. Assume that the connections between data centers will be relatively slow, at least compared to local gigabit LAN speeds within a data center. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: Namenode fails to re-start after cluster shutdown
Raghu Angadi wrote: Please report such problems if you think it was because of HDFS, as opposed to some hardware or disk failures. Will do. I suspect it's something else. I'm testing on a notebook in pseudo-distributed mode (per the quick start guide). My IP changes when I take that box between home and work so that could be it -- even though I'm running everything localhost I've seen other issues if my hostname can't get properly resolved. Also, with everything in /tmp by default, shutdowns of that box may be removing files. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Python access to HDFS
Are there any existing HDFS access packages out there for Python? I've had some success using SWIG and the C HDFS code, as documented here: http://www.stat.purdue.edu/~sguha/code.html (halfway down the page) but it's slow adding support for some of the more complex functions. If there's anything out there I missed, I'd like to hear about it. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]