Re: Storing millions of small files

2012-05-23 Thread Ted Dunning
Mongo has the best out of box experience of anything, but can be limited in terms of how far it will scale. Hbase is a bit tricky to manage if you don't have expertise in managing Hadoop. Neither is a great idea if your data objects can be as large as 10MB. On Wed, May 23, 2012 at 8:30 AM, Brend

Re: Hadoop HA

2012-05-22 Thread Ted Dunning
No. 2.0.0 will not have the same level of ha as MapR. Specifically, the job tracker hasn't been addressed and the name node Issues have only been partially addressed. On May 22, 2012, at 8:08 AM, Martinus Martinus wrote: > Hi Todd, > > Thanks for your answer. Is that will have the same capa

Re: encryption

2012-01-20 Thread Ted Dunning
Or just people who find your disks at the second-hand shop. http://www.wavy.com/dpp/news/military/tricare-beneficiaries'-data-stolen On Fri, Jan 20, 2012 at 3:36 PM, Tim Broberg wrote: > I guess the first question is the threat model: What kind of bad guy are > you trying to keep out? Is Ukrai

Re: Hadoop HDFS Backup/Restore Solutions

2012-01-03 Thread Ted Dunning
MapR provides this out of the box in a completely Hadoop compatible environment. Doing this with straight Hadoop involves a fair bit of baling wire. On Tue, Jan 3, 2012 at 1:10 PM, alo alt wrote: > Hi Mac, > > hdfs has at the moment no solution for an complete backup- and restore > process like

Re: hdfs-nfs - through chokepoint or balanced?

2011-12-16 Thread Ted Dunning
Joey is speaking precisely, but in an intentionally very limited way. Apache HDFS, the file system that comes with Apache Hadoop does not support NFS. On the other hand, maprfs which is a part of the commercial MapR distribution which is based on Apache Hadoop does support NFS natively and withou

Re: Version control of files present in HDFS

2011-11-21 Thread Ted Dunning
It is a bit off topic, but maprfs is closely equivalent to HDFS except that it provides the read-write and NFS semantics you are looking for. Trying to shoe-horn HDFS into a job that it wasn't intended to do (i.e. general file I/O) isn't a great idea. Better to use what it is good for. On Mon, N

Re: Version control of files present in HDFS

2011-11-21 Thread Ted Dunning
sizes are going bigger than MBs then it is > not good to use Hbase for storage. > > ** ** > > Any Comments **** > > ** ** > > *From:* Ted Dunning [mailto:tdunn...@maprtech.com] > *Sent:* Tuesday, November 22, 2011 11:43 AM > *To:* hdfs-user@hadoop.apache.org &g

Re: Version control of files present in HDFS

2011-11-21 Thread Ted Dunning
How big is that? On Mon, Nov 21, 2011 at 9:26 PM, Stuti Awasthi wrote: > Hi Ted, > > Well in my case document size can be big, which is not good to keep in > Hbase. So I rule out this option. > > ** ** > > Thanks**** > > ** ** > > *From:* Ted D

Re: Version control of files present in HDFS

2011-11-21 Thread Ted Dunning
HDFS is a filesystem that is designed to support map-reduce computation. As such, the semantics differ from what SVN or GIT would want to have. HBase provides versioned values. That might suffice for your needs. On Mon, Nov 21, 2011 at 9:58 AM, Stuti Awasthi wrote: > Do we have any support fr

Re: Sizing help

2011-11-11 Thread Ted Dunning
ta, not 12GB. So > about 1-in-72 such failures risks data loss, rather than 1-in-12. Which is > still unacceptable, so use 3x replication! :-) > --Matt > > On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning wrote: > >> 3x replication has two effects. One is reliability. Thi

Re: Sizing help

2011-11-08 Thread Ted Dunning
e analysis for this usage, however. On Tue, Nov 8, 2011 at 7:32 AM, Rita wrote: > Thats a good point. What is hdfs is used as an archive? We dont really use > it for mapreduce more for archival purposes. > > > On Mon, Nov 7, 2011 at 7:53 PM, Ted Dunning wrote: > >> 3x re

Re: dfs.write.packet.size set to 2G

2011-11-08 Thread Ted Dunning
By snapshots, I mean that you can freeze a copy of a portion of the the file system for later use as a backup or reference. By mirror, I mean that a snapshot can be transported to another location in the same cluster or to another cluster and the mirrored image will be updated atomically to the ne

Re: Sizing help

2011-11-07 Thread Ted Dunning
x replication on a 500tb cluster. No issues > whatsoever. 3x is for super paranoid. > > > On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning wrote: > >> Depending on which distribution and what your data center power limits >> are you may save a lot of money by going with machi

Re: Sizing help

2011-11-07 Thread Ted Dunning
Depending on which distribution and what your data center power limits are you may save a lot of money by going with machines that have 12 x 2 or 3 tb drives. With suitable engineering margins and 3 x replication you can have 5 tb net data per node and 20 nodes per rack. If you want to go all cow

Re: A question about RPC

2011-09-21 Thread Ted Dunning
IDL's are nice, but old school systems like CORBA are death when you need to change things. Avro, protobufs and thrift are all miles better. On Wed, Sep 21, 2011 at 1:59 PM, Koert Kuipers wrote: > i would love an IDL, plus that modern serialization frameworks such as > protobuf/thrift support v

Re: Regarding design of HDFS

2011-09-13 Thread Ted Dunning
2011/9/13 kang hua > Hi Master: > can you explain more detail --- "The only way to avoid this is to > make the data much more cacheable and to have a viable cache coherency > strategy. Cache coherency at the meta-data level is difficult. Cache > coherency at the block level is also diffi

Re: Regarding design of HDFS

2011-09-05 Thread Ted Dunning
The namenode is already a serious bottleneck for meta-data updates. If you allow some of the block map or meta-data to page out to disk, then the bottleneck is going to get much worse. The only way to avoid this is to make the data much more cacheable and to have a viable cache coherency strategy

Re: set reduced block size for a specific file

2011-08-27 Thread Ted Dunning
There is no way to do this for standard Apache Hadoop. But other, otherwise Hadoop compatible, systems such as MapR do support this operation. Rather than push commercial systems on this mailing list, I would simply recommend anybody who is curious to email me. On Sat, Aug 27, 2011 at 12:07 PM,

Re: HDFS File being written

2011-08-17 Thread Ted Dunning
Amen. Without a solid hand-off, your system is going to be subject all kinds of failure modes. On Wed, Aug 17, 2011 at 11:17 AM, David Rosenstrauch wrote: > You really need to employ *some* method to reliably determine when a file > is successfully uploaded, or you're going to wind up with a ver

Re: Running a server on HDFS

2011-07-12 Thread Ted Dunning
HDFS is not a normal file system. Instead highly optimized for running map-reduce. As such, it uses replicated storage but imposes a write-once model on files. This probably makes it unsuitable as a primary storage for VM's. What you need is either a conventional networked storage device or if

Re: How to create a lot files in HDFS quickly?

2011-05-29 Thread Ted Dunning
First, it is virtually impossible to create 100 million files in HDFS because the name node can't hold that many. Secondly, file creation is bottle-necked by the name node so the files that you can create can't be created at more than about 1000 per second (and achieving more than half that rate i

Re: silent data loss during append

2011-04-14 Thread Ted Dunning
What version are you using? On Thu, Apr 14, 2011 at 3:55 PM, Thanh Do wrote: > Hi all, > > I have recently seen silent data loss in our system. > Here is the case: > > 1. client appends to some block > 2. for some reason, commitBlockSynchronization > returns successfully with synclist = [] (

Re: keeping an active hdfs cluster balanced

2011-03-17 Thread Ted Dunning
How large a cluster? How large is each data-node? How much disk is devoted to hbase? How does your HDFS data arrive? From one or a few machines in the cluster? From outside the cluster? On Thu, Mar 17, 2011 at 12:13 PM, Stuart Smith wrote: > Parts of this may end up on the hbase list, but I

Re: Will blocks of an unclosed file get lost when HDFS client (or the HDFS cluster) crashes?

2011-03-13 Thread Ted Dunning
What do you mean by block? An HDFS chunk? Or a flushed write? The answer depends a bit on which version of HDFS / Hadoop you are using. With the append branches, things happen a lot more like what you expect. Without that version, it is difficult to say what will happen. Also, there are very

Re: hbase and hdfs

2011-03-08 Thread Ted Dunning
Take a look at http://opentsdb.net/ and see if it attacks your time series problem in an interesting way for what you are doing. Regarding your second comment, Zookeeper actually makes it easier to install hbase because it stabilizes the interactions between different components. There is also an