If the data is static, you may ship the file with your job jar and then read the data locally in the beginning of the map in the configure() method.
Hairong On 6/25/08 9:43 AM, "lohit" <[EMAIL PROTECTED]> wrote: > As steve mentioned you could open up a HDFS file from within your map/reduce > task. > Also instead of using DistributedFileSystem, you would actually use > FileSystem. This is what I do. > > <code> > FileSystem fs = FileSystem.get( new Configuration() ); > FSDataInputStream file = fs.open(new Path("/user/foo/jambajuice"); > </code> > > Thanks, > Lohit > ----- Original Message ---- > From: Steve Loughran <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Wednesday, June 25, 2008 9:15:55 AM > Subject: Re: Global Variables via DFS > > javaxtreme wrote: >> Hello all, >> I am having a bit of a problem with a seemingly simple problem. I would like >> to have some global variable which is a byte array that all of my map tasks >> have access to. The best way that I currently know of to do this is to have >> a file sitting on the DFS and load that into each map task (note: the global >> variable is very small ~20kB). My problem is that I can't seem to load any >> file from the Hadoop DFS into my program via the API. I know that the >> DistributedFileSystem class has to come into play, but for the life of me I >> can't get it to work. >> >> I noticed there is an initialize() method within the DistributedFileSystem >> class, and I thought that I would need to call that, however I'm unsure what >> the URI parameter ought to be. I tried "localhost:50070" which stalled the >> system and threw a connectionTimeout error. I went on to just attempt to >> call DistributedFileSystem.open() but again my program failed this time with >> a NullPointerException. I'm assuming that is stemming from he fact that my >> DFS object is not "initialized". >> >> Does anyone have any information on how exactly one programatically goes >> about loading in a file from the DFS? I would greatly appreciate any help. >> > > If the data changes, this sounds more like the kind of data that a > distributed hash table or tuple space should be looking after...sharing > facts between nodes > > 1. what is the rate of change of the data? > 2. what are your requirements for consistency? > > If the data is static, then yes, a shared file works. Here's my code > fragments to work with one. You grab the URI from the configuration, > then initialise the DFS with both the URI and the configuration. > > public static DistributedFileSystem > createFileSystem(ManagedConfiguration conf) throws > SmartFrogRuntimeException { > String filesystemURL = > conf.get(HadoopConfiguration.FS_DEFAULT_NAME); > URI uri = null; > try { > uri = new URI(filesystemURL); > } catch (URISyntaxException e) { > throw (SmartFrogRuntimeException) SmartFrogRuntimeException > .forward(ERROR_INVALID_FILESYSTEM_URI + filesystemURL, > e); > } > DistributedFileSystem dfs = new DistributedFileSystem(); > try { > dfs.initialize(uri, conf); > } catch (IOException e) { > throw (SmartFrogRuntimeException) SmartFrogRuntimeException > .forward(ERROR_FAILED_TO_INITIALISE_FILESYSTEM, e); > > } > return dfs; > } > > As to what URLs work, try "localhost:9000"; this works on machines > where I've brought a DFS up on that port. Use netstat to verify your > chosen port is live. >