Version:  hadoop-

I created a MapFile on a local node.

I  put the files into the HDFS using the following commands:

$ bin/hadoop fs -copyFromLocal /tmp/ur/data    /2008-12-19/url/data
$ bin/hadoop fs -copyFromLocal /tmp/ur/index  /2008-12-19/url/index

and placed them in the DistributedCache using the following calls in the JobConf class:

DistributedCache.addCacheFile(new URI("/2008-12-19/url/data"), conf);
DistributedCache.addCacheFile(new URI("/2008-12-19/url/index"), conf);

What I cannot figure out how to do is actually access the MapFile now within my Map code. I tried the following but I am getting file not found errors when I try to run the job.

private FileSystem              fs;
private MapFile.Reader     myReader;
private Path[]                        localFiles;


 public void configure(JobConf conf)
        String[] s = conf.getStrings("map.input.file");
        m_sFileName = s[0];

            localFiles = DistributedCache.getLocalCacheFiles(conf);

            for (Path localFile : localFiles)
                String sFileName = localFile.getName();

                if (sFileName.equalsIgnoreCase("data"))
System.out.println("Full Path: " + localFile.toString()); System.out.println("Parent: " + localFile.getParent().toString());

                    fs = FileSystem.get(localFile.toUri(), conf);
myReader = new MapFile.Reader(fs, localFile.getParent().toString(), conf);
        catch (IOException e)
            // TODO Auto-generated catch block

The following exception is thrown and I cannot figure out why it is adding the extra data element at the end of the path. The data is actually at

Task Logs: 'task_200812250002_0001_m_000000_0'

stdout logs
Full Path: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/ 2008-12-19/url/data/data Parent: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/ 2008-12-19/url/data
stderr logs File does not exist: /tmp/hadoop-root/ mapred/local/taskTracker/archive/hdp01n/2008-12-19/url/data/data at org .apache .hadoop .dfs.DistributedFileSystem.getFileStatus( 369) at org.apache.hadoop.fs.FileSystem.getLength( at$Reader.<init>( 1431) at $Reader.<init>( at $Reader.createDataFileReader( at$ at$Reader.<init>( at$Reader.<init>( at$Reader.<init>( at com .TripResearch .warehouse.etl.EtlTestUrlMapLookup.configure( 84) at org.apache.hadoop.util.ReflectionUtils.setConf( 58) at org .apache.hadoop.util.ReflectionUtils.newInstance( 82) at org.apache.hadoop.mapred.MapRunner.configure( at org.apache.hadoop.util.ReflectionUtils.setConf( 58) at org .apache.hadoop.util.ReflectionUtils.newInstance( 82) at at org.apache.hadoop.mapred.TaskTracker$Child.main(

The files do exist but I don't understand why they were placed in their own directories. I would have expected both files to exist at / 2008-12-19/url/ not /2008-12-19/url/data/ and /2008-12-19/url/index/

ls -la /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/ 2008-12-19/url/data
total 740640
drwxr-xr-x 2 root root 4096 Dec 24 23:49 .
drwxr-xr-x 4 root root 4096 Dec 24 23:49 ..
-rwxr-xr-x 1 root root 751776245 Dec 24 23:49 data
-rw-r--r-- 1 root root 5873260 Dec 24 23:49 .data.crc   

[r...@hdp01n warehouse]# ls -la /tmp/hadoop-root/mapred/local/ taskTracker/archive/hdp01n/2008-12-19/url/index
total 2148
drwxr-xr-x 2 root root    4096 Dec 25 00:04 .
drwxr-xr-x 4 root root    4096 Dec 25 00:04 ..
-rwxr-xr-x 1 root root 2165220 Dec 25 00:04 index
-rw-r--r-- 1 root root   16924 Dec 25 00:04 .index.crc


I know I must be doing something really stupid here as I am sure this has been done by lots of folks prior to my feeble attempt. I did a google search but really could not come up with any examples of using a MapFile on the DistributedCache.



