>Basically, your point is that hadoop dfs -cp is relatively slow and could be made faster. If HDFS had a more multi-threaded >design, itwould make cp operations faster. What I mean is, if we have the size of a file we can parallel by calculating blocks. Otherwise we couldn't.
On Tue, Jul 6, 2010 at 10:47 AM, Allen Wittenauer <awittena...@linkedin.com>wrote: > > On Jul 5, 2010, at 5:01 PM, elton sky wrote: > > Well, this sounds good when you have many small files, you concat() them > > into a big one. I am talking about split a big file into blocks and copy > all > > a few blocks in parallel. > > Basically, your point is that hadoop dfs -cp is relatively slow and could > be made faster. If HDFS had a more multi-threaded design, it would make cp > operations faster. > > This sounds like a particularly high cost for an operation that is rarely > utilized. [This is much more interesting in a distcp context, but even then > not that great. distcp in my experience is usually used to push a bunch of > files, so you get your parallelism at the file level. Typically these are > part files are usually the same approx. size.] > > >