Re: Can we modify files in HDFS?

2010-06-29 Thread Steve Loughran

elton sky wrote:

thanx Jeff,

So...it is a significant drawback.
As a matter of fact, there are many cases we need to modify.



When people say Hadoop filesystems are not posix, this is what they 
mean. No locks, no read/write. seeking discouraged. Even append is 
something that is just stabilising. to be fair though, even NFS is 
quirky and that's been around since Ether-net was considered so cutting 
edge it had a hyphen in the title.


HDFS delivers availability through redundant copies across multiple 
machines: you can read your data on or near any machine with a copy of 
the data. Think what you'd need for full seek and read/write actions



* seek would kill bulk IO perf on classic rotating-disk HDDs, and nobody 
can afford to build a petabyte filestore out of SSDs yet. You should be 
streaming, not seeking.


* to do writes, you'd need to lock out access to the files, which 
implies a distributed lock infrastructure (zookeeper?) or deal with 
conflicting writes.


* if you want immediate update writes you'd need to push out the changes 
to the (existing) nodes, and deal with queueing up pending changes to 
machines that are currently offline in a way that I don't want to think 
about.


* if you want slower-update writes (eventual consistency), then things 
may be slightly simpler -you'd need a lock on writing and each write 
would eventually be pushed out to the readers with a bit better 
bandwidth and CPU scheduling flexibility , but there's still that 
offline node problem. If a node that was down comes up, how does it know 
it's data is out of date and where does it get the data from? What will 
it do if all other nodes that have updated data are offline.



 I dont understand why Yahoo didn't provoid that functionality. And as 
I know

 no one else is working on this. Why is that?

It's because it scares us and we are happier writing code to live in a 
world where you don't seek and patch files, but instead add new data and 
delete old stuff. I don't know what the Cassandra and HBase teams do here.


-steve






Can we modify files in HDFS?

2010-06-28 Thread elton sky
hello everyone,

After some research I found HDFS only support create new file and append to
exiting file. What if I want to modify some parts of a, say 2 Petabyte,
file.
Do I have to remove it and create it again or we have some alternative way?


Re: Can we modify files in HDFS?

2010-06-28 Thread elton sky
thanx Jeff,

So...it is a significant drawback.
As a matter of fact, there are many cases we need to modify.
I dont understand why Yahoo didn't provoid that functionality. And as I know
no one else is working on this. Why is that?


Re: Can we modify files in HDFS?

2010-06-28 Thread Todd Lipcon
Hi Elton,

Typically, large data sets are of the sort that continuously grow, and are
not edited or amended. For example, a common Hadoop use case is the analysis
of log data or other instrumentation from web or application servers. In
these cases, files are simply added, but there is no need to go back and
change entries.

For the ability to have a more table-like random access storage on top of
Hadoop, I would encourage you to look into HBase. It supports random
read/write access with low latency.

-Todd

On Mon, Jun 28, 2010 at 9:48 PM, elton sky eltonsky9...@gmail.com wrote:

 thanx Jeff,

 So...it is a significant drawback.
 As a matter of fact, there are many cases we need to modify.
 I dont understand why Yahoo didn't provoid that functionality. And as I
 know
 no one else is working on this. Why is that?




-- 
Todd Lipcon
Software Engineer, Cloudera