On Jul 5, 2010, at 5:01 PM, elton sky wrote: > Well, this sounds good when you have many small files, you concat() them > into a big one. I am talking about split a big file into blocks and copy all > a few blocks in parallel.
Basically, your point is that hadoop dfs -cp is relatively slow and could be made faster. If HDFS had a more multi-threaded design, it would make cp operations faster. This sounds like a particularly high cost for an operation that is rarely utilized. [This is much more interesting in a distcp context, but even then not that great. distcp in my experience is usually used to push a bunch of files, so you get your parallelism at the file level. Typically these are part files are usually the same approx. size.]