Hi,
I am running a hdfs on Amazon EC2
Say, I have a ftp server where stores some data.
I just want to copy these data directly to hdfs in a parallel way (which
maybe more efficient).
I think hadoop distcp is what I need.
But
$ bin/hadoop distcp ftp://username:passwd@hostname/some/path/
hdfs://namenode/some/path
doesn't work.
13/07/05 16:13:46 INFO tools.DistCp:
srcPaths=[ftp://username:passwd@hostname/some/path/]
13/07/05 16:13:46 INFO tools.DistCp: destPath=hdfs://namenode/some/path
Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input
source ftp://username:passwd@hostname/some/path/ does not exist.
at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:641)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
I checked the path by copying the ftp path in Chrome , and the file
really exists, I can even download it.
And then, I tried to list the files under the path by:
$ bin/hadoop dfs -ls ftp://username:passwd@hostname/some/path/
It ends with:
ls: Cannot access ftp://username:passwd@hostname/some/path/: No
such file or directory.
That seems the same pb.
Any workaround here ?
Thank you in advance.
Hao.
--
Hao Ren
ClaraVista
www.claravista.fr