Re: configuration access

2008-03-04 Thread Steve Sapovits
Arun C Murthy wrote:

 Can you re-check if the right paths (for your config files) are on the
 CLASSPATH?

That was it.  Thanks.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Re: Amazon S3 questions

2008-03-02 Thread Steve Sapovits
Chris K Wensel wrote:

 do you have an underscore in your bucket name?

Yes I do.

Here's a sample error message/stack trace.  This is version 0.16.0:

localhost: Exception in thread main java.lang.IllegalArgumentException: port 
out of range:-1
localhost:  at java.net.InetSocketAddress.init(InetSocketAddress.java:118)
localhost:  at 
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:125)
localhost:  at 
org.apache.hadoop.dfs.SecondaryNameNode.init(SecondaryNameNode.java:94)
localhost:  at 
org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:492)

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Re: Amazon S3 questions

2008-03-02 Thread Steve Sapovits

 do you have an underscore in your bucket name?
 
 Yes I do.

Sorry I was wrong -- no underscores.  Some do but this particular one uses only 
dashes.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Amazon S3 questions

2008-03-01 Thread Steve Sapovits


I have some confusion over the use of Amazon S3 as storage.

I was looking at the fs.default.name as the name node -- a host and a port
the client uses to ask the name node to perform DFS services.  But for Amazon
S3 you give it an S3 bucket URL, which is really just a direct pointer to the 
storage.  So it seems fs.default.name is really just a storage setting that happens

to be a service (host/port) for HDFS.  Even though S3 is also a service, it 
can't
also be a name node.

If that's true, where is the host/port of the name node configured separate from
fs.default.name?  It would seem no matter what the underlying storage you'd need
a name node service somewhere.

I'm probably missing something bigger here ...

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Re: Amazon S3 questions

2008-03-01 Thread Steve Sapovits

Bradford Stephens wrote:


What sort of performance hit is there for using S3 vs.  a local cluster?


It probably only makes sense speed-wise if you're running on EC2.  S3 access
from EC2 is a lot faster than accessing it from outside the Amazon cloud.

If you run on EC2, S3 is essentially your persistent storage.

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]


Re: long write operations and data recovery

2008-02-29 Thread Steve Sapovits

Ted Dunning wrote:

In our case, we looked at the problem and decided that Hadoop wasn't 
feasible for our real-time needs in any case.  There were several

issues,

- first, of all, map-reduce itself didn't seem very plausible for
real-time applications.  That left hbase and hdfs as the capabilities
offered by hadoop (for real-time stuff)


We'll be using map-reduce batch mode, so we're okay there.


The upshot is that we use hadoop extensively for batch operations
where it really shines.  The other nice effect is that we don't have
to worry all that much about HA (at least not real-time HA) since we
don't do real-time with hadoop.


What I'm struggling with is the write side of things.  We'll have a huge
amount of data to write that's essentially a log format.  It would seem
that writing that outside of HDFS then trying to batch import it would
be a losing battle -- that you would need the distributed nature of HDFS
to do very large volume writes directly and wouldn't easily be able to take
some other flat storage model and feed it in as a secondary step without
having the HDFS side start to lag behind.

The realization is that Name Node could go down so we'll have to have a
backup store that might be used during temporary outages, but that 
most of the writes would be direct HDFS updates.


The alternative would seem to be to end up with a set of distributed files
without some unifying distributed file system (e.g., like lots of Apache 
web logs on many many individual boxes) and then have to come up with

some way to funnel those back into HDFS.

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Re: Cross-data centre DFS communication?

2008-02-28 Thread Steve Sapovits

Owen O'Malley wrote:

To copy between clusters, there is a tool called distcp. Look at 
bin/hadoop distcp. It runs a map/reduce job that copies a group of 
files. It can also be used to copy between versions of hadoop, if the 
source file system is hftp, which uses xml to read hdfs.


Can you further explain the hftp part of this?  I'm not familiar with that. 
We have a similar need to go cross-data center.  In an earlier post it

was suggested that there was no map/reduce model for that so this
sounds more like what we're looking for. 


--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]


Re: Cross-data centre DFS communication?

2008-02-28 Thread Steve Sapovits

Owen O'Malley wrote:

Sure, the info server on the name node of HDFS has a read-only interface 
that lists directories in xml and allows the client to read files over 
http. There is a FileSystem implementation that provides the client side 
interface to the xml/http access.


To use it, you need a path with hftp as the protocol:
hadoop distcp hftp://namenode1:50070/foo/bar hdfs://namenode2:8020/foo


Very useful.  Thanks.

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Re: long write operations and data recovery

2008-02-28 Thread Steve Sapovits

dhruba Borthakur wrote:


The Namenode maintains a lease for every open file that is being written
to. If the client that was writing to the file disappears, the Namenode
will do lease recovery after expiry of the lease timeout (1 hour). The
lease recovery process (in most cases) will remove the last block from
the file (it was not fully written because the client crashed before it
could fill up the block) and close the file.


How does replication affect this?  If there's at least one replicated client 
still
running, I assume that takes care of it?

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]


Re: long write operations and data recovery

2008-02-28 Thread Steve Sapovits



How does replication affect this?  If there's at least one replicated
 client still running, I assume that takes care of it?


Never mind -- I get this now after reading the docs again.

My remaining point of failure question concerns name nodes.  The docs say manual 
intervention is still required if a name node goes down.  How is this typically managed
in production environments?   It would seem even a short name node outage in a 
data intestive environment would lead to data loss (no name node to give the data

to).

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Local testing and DHCP

2008-02-27 Thread Steve Sapovits


When running in Pseudo Distributed mode as outlined in the Quickstart, I see 
that
the DFS is, at some level, identified by the IP address it was created under. 
  I''m

doing this on a laptop and when I take it to another network, the daemons come
up okay but they can't find the DFS.  It looks like it's because the IP is 
different

from when the DFS was first created.  Is there a way around this so I can run on
the same box and see the same DFS regardless of what its IP is?

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]


Re: Local testing and DHCP

2008-02-27 Thread Steve Sapovits

Joydeep Sen Sarma wrote:


a few of our nodes had (for inexplicable reasons) bound to 
localhost.localdomain for a while.
definitely for map-reduce - this cases problems (not sure about hdfs). jobs 
were failing saying
they could not find 'localhost.localdomain' (i think this was in the reduce 
copy phase trying to
contact map outputs). i am not terribly sure of the details - but there are 
issues with this ..


I have a situation now, like I've seen before, where my config. is exactly like 
it was yesterday
but something about my network set-up is different and I can't get the pseudo distributed 
copy to come up at all on the box.  It looks like the name node is out there but the URL tries

going to the dfshealth.jsp page and that fails with a 404 error.

Very frustrating, as I'm something spending hours trying to get a local test 
set up to install
the same way it did the day before.

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



long write operations and data recovery

2008-02-25 Thread Steve Sapovits


If I have a write operation that takes a while between opening and closing
the file, what is the effect of a node doing that writing crashing in the 
middle?
For example, suppose I have large logs that I write to continually, rolling them
every N minutes (say every hour for the sake of discussion).  If I have the file
opened and am 90% done my writes and things crash, what happens to the
data I've written ... realizing that at some level, data isn't visible to the 
rest
of the cluster until the file is closed.

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Re: map/reduce across a WAN

2008-02-25 Thread Steve Sapovits

Ted Dunning wrote:


I think the best way to accomplish this sort of goal is to go ahead and run
independent clusters and somehow add the ability to propagate files between
clusters.  Then the cross-cluster map-reduce can run in the cluster that has
originals or replicas of all of the necessary files.


The problem is the amount of data.  We're using HDFS because the volume will
be huge.  On the surface, replicating files across data centers would appear to
take something that's sized to need map/reduce and force it down a serial sort
of pipe, over a slow connection.   Then we'd have to fan it back out to HDFS at
each replicated site in order to crunch the data.

That's why I was thinking if map/reduce pushes most of the work off to the
individual nodes where the data resides, that somehow map/reducing across
the entire set of boxes in all data centers made more sense -- that would seem
to reduce the volume of data flowing between data centers.

Assume that the connections between data centers will be relatively slow, at
least compared to local gigabit LAN speeds within a data center.

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Re: Namenode fails to re-start after cluster shutdown

2008-02-22 Thread Steve Sapovits

Raghu Angadi wrote:

Please report such problems if you think it was because of HDFS, as 
opposed to some hardware or disk failures.


Will do.  I suspect it's something else.  I'm testing on a notebook in 
pseudo-distributed
mode (per the quick start guide).  My IP changes when I take that box between 
home
and work so that could be it -- even though I'm running everything localhost I've seen 
other issues if my hostname can't get properly resolved.  Also, with everything in /tmp
by default, shutdowns of that box may be removing files.  


--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]



Python access to HDFS

2008-02-21 Thread Steve Sapovits


Are there any existing HDFS access packages out there for Python?

I've had some success using SWIG and the C HDFS code, as documented
here:

http://www.stat.purdue.edu/~sguha/code.html

(halfway down the page) but it's slow adding support for some of the more
complex functions.  If there's anything out there I missed, I'd like to hear
about it.

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]