Tsz Wo (Nicholas), Sze <s29752-hadoopu...@...> writes: > > Hi Derek, > > The "http" in "http://core:7274/logs/log.20090121" should be "hftp". hftp is the scheme name of > HftpFileSystem which uses http for accessing hdfs. > > Hope this helps. > > Nicholas Sze
I thought hftp is used to talk to servlets that act as a gateway to hdfs right? In my case these will be servers that are serving up static log files, running no servlets. I believe this is the scenario that HADOOP-341 describes: "Enhance it [distcp] to handle http as the source protocol i.e. support copying files from arbitrary http-based sources into the dfs." In any case if I just use hftp instead of http I get this error: bin/hadoop distcp -f hftp://core:7274/logs/log.20090121 /user/dyoung/mylogs With failures, global counters are inaccurate; consider running with -i Copy failed: java.io.IOException: Server returned HTTP response code: 400 for URL: http://core:7274/data/logs/log.20090121? ugi=dyoung,dyoung,adm,dialout,fax,cdrom,cdrom,\ floppy,floppy,tape,audio,audio,dip,dip,video,video,\ plugdev,plugdev,admin,users,scanner,fuse,fuse,lpadmin,\ admin,vboxusers at sun.net.www.protocol.http.HttpURLConnection.getInputStream (HttpURLConnection.java:1241) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:124) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:359) at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:581) at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74) at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775) at org.apache.hadoop.tools.DistCp.run(DistCp.java:844) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:871) > > ----- Original Message ---- > > From: Derek Young <dyo...@...> > > To: core-u...@... > > Sent: Wednesday, January 21, 2009 1:23:56 PM > > Subject: using distcp for http source files > > > > I plan to use hadoop to do some log processing and I'm working on a method to > > load the files (probably nightly) into hdfs. My plan is to have a web server on > > each machine with logs that serves up the log directories. Then I would give > > distcp a list of http URLs of the log files and have it copy the files in. > > > > Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this > > should be supported, but the http URLs are not working for me. Are http source > > URLs still supported? > > > > I tried a simple test with an http source URL (using Hadoop 0.19): > > > > hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs > > > > This fails: > > > > With failures, global counters are inaccurate; consider running with -i > > Copy failed: java.io.IOException: No FileSystem for scheme: http > > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364) > > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56) > > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379) > > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) > > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) > > at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578) > > at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74) > > at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775) > > at org.apache.hadoop.tools.DistCp.run(DistCp.java:844) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > > at org.apache.hadoop.tools.DistCp.main(DistCp.java:871) > >