Re: Having trouble accessing MapFiles in the DistributedCache

Sean Shanny Fri, 26 Dec 2008 00:24:00 -0800

Thanks for your suggestion but unfortunately it did not fix the issue.



Thanks.

--sean

Sean Shanny
ssha...@tripadvisor.com




On Dec 25, 2008, at 8:19 AM, Devaraj Das wrote:

IIRC, enabling symlink creation for your files should solve theproblem.

Call DistributedCache.createSymLink(); before submitting your job.



On 12/25/08 10:40 AM, "Sean Shanny" <ssha...@tripadvisor.com> wrote:

To all,

Version:  hadoop-0.17.2.1-core.jar

I created a MapFile on a local node.

I  put the files into the HDFS using the following commands:

$ bin/hadoop fs -copyFromLocal /tmp/ur/data    /2008-12-19/url/data
$ bin/hadoop fs -copyFromLocal /tmp/ur/index  /2008-12-19/url/index

and placed them in the DistributedCache using the following calls in
the JobConf class:

DistributedCache.addCacheFile(new URI("/2008-12-19/url/data"), conf);

DistributedCache.addCacheFile(new URI("/2008-12-19/url/index"),conf);


What I cannot figure out how to do is actually access the MapFile now
within my Map code.  I tried the following but I am getting file not
found errors when I try to run the job.

private FileSystem              fs;
private MapFile.Reader     myReader;
private Path[]                        localFiles;

....

 public void configure(JobConf conf)
    {
        String[] s = conf.getStrings("map.input.file");
        m_sFileName = s[0];

       try
        {
            localFiles = DistributedCache.getLocalCacheFiles(conf);

            for (Path localFile : localFiles)
            {
                String sFileName = localFile.getName();

                if (sFileName.equalsIgnoreCase("data"))
                {
                    System.out.println("Full Path: " +
localFile.toString());
                    System.out.println("Parent: " +
localFile.getParent().toString());

                    fs = FileSystem.get(localFile.toUri(), conf);
                    myReader = new MapFile.Reader(fs,
localFile.getParent().toString(), conf);
                }
            }
        }
        catch (IOException e)
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

The following exception is thrown and I cannot figure out why it is
adding the extra data element at the end of the path.  The data is
actually at

Task Logs: 'task_200812250002_0001_m_000000_0'

stdout logs
Full Path: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
2008-12-19/url/data/data
Parent: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
2008-12-19/url/data
stderr logs
java.io.FileNotFoundException: File does not exist: /tmp/hadoop-root/
mapred/local/taskTracker/archive/hdp01n/2008-12-19/url/data/data at
org
.apache
.hadoop
.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:

369) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:628)

at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:
1431) at org.apache.hadoop.io.SequenceFile

$Reader.<init>(SequenceFile.java:1426) atorg.apache.hadoop.io.MapFile

$Reader.createDataFileReader(MapFile.java:301) at
org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:283) at
org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:272) at
org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:259) at
org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:252) at
com
.TripResearch

.warehouse.etl.EtlTestUrlMapLookup.configure(EtlTestUrlMapLookup.java:

84) at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:
58) at
org
.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:

82) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)atorg.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:

58) at
org
.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215) at

org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)


The files do exist but I don't understand why they were placed in

their own directories. I would have expected both files to existat /

2008-12-19/url/ not /2008-12-19/url/data/ and /2008-12-19/url/index/

ls -la /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/
2008-12-19/url/data
total 740640
drwxr-xr-x 2 root root 4096 Dec 24 23:49 .
drwxr-xr-x 4 root root 4096 Dec 24 23:49 ..
-rwxr-xr-x 1 root root 751776245 Dec 24 23:49 data
-rw-r--r-- 1 root root 5873260 Dec 24 23:49 .data.crc

[r...@hdp01n warehouse]# ls -la /tmp/hadoop-root/mapred/local/
taskTracker/archive/hdp01n/2008-12-19/url/index
total 2148
drwxr-xr-x 2 root root    4096 Dec 25 00:04 .
drwxr-xr-x 4 root root    4096 Dec 25 00:04 ..
-rwxr-xr-x 1 root root 2165220 Dec 25 00:04 index
-rw-r--r-- 1 root root   16924 Dec 25 00:04 .index.crc

....

I know I must be doing something really stupid here as I am sure this
has been done by lots of folks prior to my feeble attempt.  I did a
google search but really could not come up with any examples of using
a MapFile on the DistributedCache.

Thanks.

--sean

Re: Having trouble accessing MapFiles in the DistributedCache

Reply via email to