Archives not getting unarchived at tasktrackers

2009-06-27 Thread akhil1988

Hi All,

I am using DistributedCache.addCacheArchives() to distribute a tar file to
the tasktrackers using the following statement.

DistributedCache.addCacheArchives(new URI(/home/akhil1988/sample.tar),
conf);

According to the documentation it should get unarchived at the tasktrackers.
But the statement:

DistributedCache.getLocalCacheArchives(conf); 

returns the following Path

/hadoop/tmp/hadoop/mapred/local/taskTracker/archive/cn1.cloud.cs.illinois.edu/home/akhil1988/sample.tar

That means sample.tar did not get unarchived.
Nor I am able to access file sample.txt in the above folder. 

Can anyone tell where I am going wrong?

I tarred the file sample.txt using the following command: tar -cvf
sample.tar sample.txt 

Thanks,
Akhil
-- 
View this message in context: 
http://www.nabble.com/Archives-not-getting-unarchived-at-tasktrackers-tp24233281p24233281.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Using addCacheArchive

2009-06-26 Thread akhil1988

Thanks Chris for your reply!

Well, I could not understand much of what has been discussed on that forum.
I am unaware of Cascading.

My problem is simple - I want a directory to present in the local working
directory of tasks so that I can access it from my map task in the following
manner :

FileInputStream fin = new FileInputStream(Config/file1.config); 

where,
Config is a directory which contains many files/directories, one of which is
file1.config

It would be helpful to me if you can tell me what statements to use to
distribute a directory to the tasktrackers.
The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says
that archives are unzipped on the tasktrackers but I want an example of how
to use this in case of a dreictory.

Thanks,
Akhil



Chris Curtin-2 wrote:
 
 Hi,
 
 I've found it much easier to write the file to HDFS use the API, then pass
 the 'path' to the file in HDFS as a property. You'll need to remember to
 clean up the file after you're done with it.
 
 Example details are in this thread:
 http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
 
 Hope this helps,
 
 Chris
 
 On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 akhilan...@gmail.com wrote:
 

 Please ask any questions if I am not clear above about the problem I am
 facing.

 Thanks,
 Akhil

 akhil1988 wrote:
 
  Hi All!
 
  I want a directory to be present in the local working directory of the
  task for which I am using the following statements:
 
  DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip),
  conf);
  DistributedCache.createSymlink(conf);
 
  Here Config is a directory which I have zipped and put at the given
  location in HDFS
 
  I have zipped the directory because the API doc of DistributedCache
  (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
 the
  archive files are unzipped in the local cache directory :
 
  DistributedCache can be used to distribute simple, read-only data/text
  files and/or more complex types such as archives, jars etc. Archives
 (zip,
  tar and tgz/tar.gz files) are un-archived at the slave nodes.
 
  So, from my understanding of the API docs I expect that the Config.zip
  file will be unzipped to Config directory and since I have SymLinked
 them
  I can access the directory in the following manner from my map
 function:
 
  FileInputStream fin = new FileInputStream(Config/file1.config);
 
  But I get the FileNotFoundException on the execution of this statement.
  Please let me know where I am going wrong.
 
  Thanks,
  Akhil
 

 --
 View this message in context:
 http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Using addCacheArchive

2009-06-25 Thread akhil1988

Hi All!

I want a directory to be present in the local working directory of the task
for which I am using the following statements: 

DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip),
conf);
DistributedCache.createSymlink(conf);

 Here Config is a directory which I have zipped and put at the given
 location in HDFS

I have zipped the directory because the API doc of DistributedCache
(http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the
archive files are unzipped in the local cache directory :

DistributedCache can be used to distribute simple, read-only data/text files
and/or more complex types such as archives, jars etc. Archives (zip, tar and
tgz/tar.gz files) are un-archived at the slave nodes.

So, from my understanding of the API docs I expect that the Config.zip file
will be unzipped to Config directory and since I have SymLinked them I can
access the directory in the following manner from my map function:

FileInputStream fin = new FileInputStream(Config/file1.config);

But I get the FileNotFoundException on the execution of this statement.
Please let me know where I am going wrong.

Thanks,
Akhil
-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24207739.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Using addCacheArchive

2009-06-25 Thread akhil1988

Please ask any questions if I am not clear above about the problem I am
facing.

Thanks,
Akhil

akhil1988 wrote:
 
 Hi All!
 
 I want a directory to be present in the local working directory of the
 task for which I am using the following statements: 
 
 DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip),
 conf);
 DistributedCache.createSymlink(conf);
 
 Here Config is a directory which I have zipped and put at the given
 location in HDFS
 
 I have zipped the directory because the API doc of DistributedCache
 (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the
 archive files are unzipped in the local cache directory :
 
 DistributedCache can be used to distribute simple, read-only data/text
 files and/or more complex types such as archives, jars etc. Archives (zip,
 tar and tgz/tar.gz files) are un-archived at the slave nodes.
 
 So, from my understanding of the API docs I expect that the Config.zip
 file will be unzipped to Config directory and since I have SymLinked them
 I can access the directory in the following manner from my map function:
 
 FileInputStream fin = new FileInputStream(Config/file1.config);
 
 But I get the FileNotFoundException on the execution of this statement.
 Please let me know where I am going wrong.
 
 Thanks,
 Akhil
 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Using addCacheArchive

2009-06-25 Thread akhil1988

Thanks Amareshwari for your reply!

The file Config.zip is lying in the HDFS, if it would not have been then the
error would be reported by the jobtracker itself while executing the
statement:
DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip),
conf);

But I get error in the map function when I try to access the Config
directory. 

Now I am using the following statement but still getting the same error: 
DistributedCache.addCacheArchive(new
URI(/home/akhil1988/Config.zip#Config), conf);

Do you think whether there should be any problem in distributing a zipped
directory and then hadoop unzipping it recursively.

Thanks!
Akhil



Amareshwari Sriramadasu wrote:
 
 Hi Akhil,
 
 DistributedCache.addCacheArchive takes path on hdfs. From your code, it
 looks like you are passing local path.
 Also, if you want to create symlink, you should pass URI as
 hdfs://path#linkname, besides calling  
 DistributedCache.createSymlink(conf);
 
 Thanks
 Amareshwari
 
 
 akhil1988 wrote:
 Please ask any questions if I am not clear above about the problem I am
 facing.

 Thanks,
 Akhil

 akhil1988 wrote:
   
 Hi All!

 I want a directory to be present in the local working directory of the
 task for which I am using the following statements: 

 DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip),
 conf);
 DistributedCache.createSymlink(conf);

 
 Here Config is a directory which I have zipped and put at the given
 location in HDFS
 
 I have zipped the directory because the API doc of DistributedCache
 (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
 the
 archive files are unzipped in the local cache directory :

 DistributedCache can be used to distribute simple, read-only data/text
 files and/or more complex types such as archives, jars etc. Archives
 (zip,
 tar and tgz/tar.gz files) are un-archived at the slave nodes.

 So, from my understanding of the API docs I expect that the Config.zip
 file will be unzipped to Config directory and since I have SymLinked
 them
 I can access the directory in the following manner from my map function:

 FileInputStream fin = new FileInputStream(Config/file1.config);

 But I get the FileNotFoundException on the execution of this statement.
 Please let me know where I am going wrong.

 Thanks,
 Akhil

 

   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24214657.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Using addCacheArchive

2009-06-25 Thread akhil1988

Yes, my HDFS paths are of the form /home/user-name/
And I have used these in DistributedCache's addCacheFiles method
successfully. 

Thanks,
Akhil



Amareshwari Sriramadasu wrote:
 
 Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the
 form /user/akhil1988/Config.zip.
 Just wondering if you are giving wrong path in the uri!
 
 Thanks
 Amareshwari
 
 akhil1988 wrote:
 Thanks Amareshwari for your reply!

 The file Config.zip is lying in the HDFS, if it would not have been then
 the
 error would be reported by the jobtracker itself while executing the
 statement:
 DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip),
 conf);

 But I get error in the map function when I try to access the Config
 directory. 

 Now I am using the following statement but still getting the same error: 
 DistributedCache.addCacheArchive(new
 URI(/home/akhil1988/Config.zip#Config), conf);

 Do you think whether there should be any problem in distributing a zipped
 directory and then hadoop unzipping it recursively.

 Thanks!
 Akhil



 Amareshwari Sriramadasu wrote:
   
 Hi Akhil,

 DistributedCache.addCacheArchive takes path on hdfs. From your code, it
 looks like you are passing local path.
 Also, if you want to create symlink, you should pass URI as
 hdfs://path#linkname, besides calling  
 DistributedCache.createSymlink(conf);

 Thanks
 Amareshwari


 akhil1988 wrote:
 
 Please ask any questions if I am not clear above about the problem I am
 facing.

 Thanks,
 Akhil

 akhil1988 wrote:
   
   
 Hi All!

 I want a directory to be present in the local working directory of the
 task for which I am using the following statements: 

 DistributedCache.addCacheArchive(new
 URI(/home/akhil1988/Config.zip),
 conf);
 DistributedCache.createSymlink(conf);

 
 
 Here Config is a directory which I have zipped and put at the given
 location in HDFS
 
 
 I have zipped the directory because the API doc of DistributedCache
 (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
 the
 archive files are unzipped in the local cache directory :

 DistributedCache can be used to distribute simple, read-only data/text
 files and/or more complex types such as archives, jars etc. Archives
 (zip,
 tar and tgz/tar.gz files) are un-archived at the slave nodes.

 So, from my understanding of the API docs I expect that the Config.zip
 file will be unzipped to Config directory and since I have SymLinked
 them
 I can access the directory in the following manner from my map
 function:

 FileInputStream fin = new FileInputStream(Config/file1.config);

 But I get the FileNotFoundException on the execution of this
 statement.
 Please let me know where I am going wrong.

 Thanks,
 Akhil

 
 
   
   

 

   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24214730.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Strange Exeception

2009-06-23 Thread akhil1988

Thanks Jason!

I gave your suggestion to my cluster administrator and now it is working.
Following was his reply to me:

But /hadoop/tmp is not /scratch and the only thing that I clean is
/scratch.   It looks like the disks in the job tracker machine died.  I
swapped the disks from another node and rebuilt it.As far as I can tell
it is working.

Thanks Again!



jason hadoop wrote:
 
 The directory specified by the configuration parameter mapred.system.dir,
 defaulting to /tmp/hadoop/mapred/system, doesn't exist.
 
 Most likely your tmp cleaner task has removed it, and I am guessing it is
 only created at cluster start time.
 
 On Mon, Jun 22, 2009 at 6:19 PM, akhil1988 akhilan...@gmail.com wrote:
 

 Hi All!

 I have been running Hadoop jobs through my user account on a cluster, for
 a
 while now. But now I am getting this strange exception when I try to
 execute
 a job. If anyone knows, please let me know why this is happening.

 [akhil1...@altocumulus WordCount]$ hadoop jar wordcount_classes_dir.jar
 org.uiuc.upcrc.extClasses.WordCount /home/akhil1988/input
 /home/akhil1988/output
 JO
 09/06/22 19:19:01 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException:
 /hadoop/tmp/hadoop/mapred/local/jobTracker/job_200906111015_0167.xml
 (Read-only file system)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:179)
at

 org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:187)
at

 org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:183)
at
 org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:241)
at

 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.init(ChecksumFileSystem.java:327)
at
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:360)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:208)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at
 org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1214)
at
 org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1195)
at
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:212)
at
 org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2230)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:828)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1127)
at org.uiuc.upcrc.extClasses.WordCount.main(WordCount.java:70)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


 Thanks,
 Akhil

 --
 View this message in context:
 http://www.nabble.com/Strange-Exeception-tp24158395p24158395.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.


 
 
 -- 
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals
 
 

-- 
View this message in context: 
http://www.nabble.com/Strange-Exeception-tp24158395p24177363.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Strange Exeception

2009-06-22 Thread akhil1988

Hi All!

I have been running Hadoop jobs through my user account on a cluster, for a
while now. But now I am getting this strange exception when I try to execute
a job. If anyone knows, please let me know why this is happening.

[akhil1...@altocumulus WordCount]$ hadoop jar wordcount_classes_dir.jar
org.uiuc.upcrc.extClasses.WordCount /home/akhil1988/input
/home/akhil1988/output
JO
09/06/22 19:19:01 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException:
/hadoop/tmp/hadoop/mapred/local/jobTracker/job_200906111015_0167.xml
(Read-only file system)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:179)
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:187)
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:183)
at
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:241)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.init(ChecksumFileSystem.java:327)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:360)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:208)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1214)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1195)
at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:212)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2230)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:828)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1127)
at org.uiuc.upcrc.extClasses.WordCount.main(WordCount.java:70)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


Thanks,
Akhil

-- 
View this message in context: 
http://www.nabble.com/Strange-Exeception-tp24158395p24158395.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-18 Thread akhil1988

Hi Jason!

I finally found out that there was some problem in reserving the HEAPSIZE
which I have resolved now. Actually we cannot change the HADOOP_HEAPSIZE
using export from our user account, after we have started the Hadoop. It has
to changed by the root.

I have a user account on the cluster and I was trying to change the
Hadoop_heapsize from my user account using 'export' which had no effect.
So I had to request my cluster administrator to increase the HADOOP_HEAPSIZE
in hadoop-env.sh and then restart hadoop. Now the program is running
absolutely fine. Thanks for your help.

One thing that I would like to ask you is that can we use DistributerCache
for transferring directories to the local cache of the tasks?

Thanks,
Akhil



akhil1988 wrote:
 
 Hi Jason!
 
 Thanks for going with me to solve my problem.
 
 To restate things and make it more easier to understand: I am working in
 local mode in the directory which contains the job jar and also the Config
 and Data directories.
 
 I just removed the following three statements from my code:
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Data/), conf);
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Config/), conf);
 DistributedCache.createSymlink(conf);
 
 The program executes till the same point as before now also and
 terminates. That means the above three statements are of no use while
 working in local mode. In local mode, the working directory for the
 mapreduce tasks becomes the current woking direcotry in which you started
 the hadoop command to execute the job.
 
 Since I have removed the DistributedCache.add. statements there should
 be no issue whether I am giving a file name or a directory name as
 argument to it. Now it seems to me that there is some problem in reading
 the binary file using binaryRead.
 
 Please let me know if I am going wrong anywhere.
 
 Thanks,
 Akhil
  
 
 
 
 
 jason hadoop wrote:
 
 I have only ever used the distributed cache to add files, including
 binary
 files such as shared libraries.
 It looks like you are adding a directory.
 
 The DistributedCache is not generally used for passing data, but for
 passing
 file names.
 The files must be stored in a shared file system (hdfs for simplicity)
 already.
 
 The distributed cache makes the names available to the tasks, and the the
 files are extracted from hdfs and stored in the task local work area on
 each
 task tracker node.
 It looks like you may be storing the contents of your files in the
 distributed cache.
 
 On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com wrote:
 

 Thanks Jason.

 I went inside the code of the statement and found out that it eventually
 makes some binaryRead function call to read a binary file and there it
 strucks.

 Do you know whether there is any problem in giving a binary file for
 addition to the distributed cache.
 In the statement DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory
 which contains some text as well as some binary files. In the statement
 Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I
 can
 see(in the output messages) that it is able to read the text files but
 it
 gets struck at the binary files.

 So, I think here the problem is: it is not able to read the binary files
 which either have not been transferred to the cache or a binary file
 cannot
 be read.

 Do you know the solution to this?

 Thanks,
 Akhil


 jason hadoop wrote:
 
  Something is happening inside of your (Parameters.
  readConfigAndLoadExternalData(Config/allLayer1.config);)
  code, and the framework is killing the job for not heartbeating for
 600
  seconds
 
  On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com
 wrote:
 
 
  One more thing, finally it terminates there (after some time) by
 giving
  the
  final Exception:
 
  java.io.IOException: Job failed!
 at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
  at LbjTagger.NerTagger.main(NerTagger.java:109)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 
 
  akhil1988 wrote:
  
   Thank you Jason for your reply.
  
   My Map class is an inner class and it is a static class. Here is
 the
   structure of my code.
  
   public class NerTagger {
  
   public static class Map extends

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-17 Thread akhil1988

Thanks Jason.

I went inside the code of the statement and found out that it eventually
makes some binaryRead function call to read a binary file and there it
strucks.

Do you know whether there is any problem in giving a binary file for
addition to the distributed cache. 
In the statement DistributedCache.addCacheFile(new
URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory
which contains some text as well as some binary files. In the statement 
Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I can
see(in the output messages) that it is able to read the text files but it
gets struck at the binary files.

So, I think here the problem is: it is not able to read the binary files
which either have not been transferred to the cache or a binary file cannot
be read.

Do you know the solution to this?

Thanks,
Akhil


jason hadoop wrote:
 
 Something is happening inside of your (Parameters.
 readConfigAndLoadExternalData(Config/allLayer1.config);)
 code, and the framework is killing the job for not heartbeating for 600
 seconds
 
 On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote:
 

 One more thing, finally it terminates there (after some time) by giving
 the
 final Exception:

 java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
 at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


 akhil1988 wrote:
 
  Thank you Jason for your reply.
 
  My Map class is an inner class and it is a static class. Here is the
  structure of my code.
 
  public class NerTagger {
 
  public static class Map extends MapReduceBase implements
  MapperLongWritable, Text, Text, Text{
  private Text word = new Text();
  private static NETaggerLevel1 tagger1 = new
  NETaggerLevel1();
  private static NETaggerLevel2 tagger2 = new
  NETaggerLevel2();
 
  Map(){
  System.out.println(HI2\n);
 
  Parameters.readConfigAndLoadExternalData(Config/allLayer1.config);
  System.out.println(HI3\n);
 
  Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true);
 
  System.out.println(loading the tagger);
 
 
 tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1);
  System.out.println(HI5\n);
 
 
 tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2);
  System.out.println(Done- loading the tagger);
  }
 
  public void map(LongWritable key, Text value,
  OutputCollectorText, Text output, Reporter reporter ) throws
 IOException
  {
  String inputline = value.toString();
 
  /* Processing of the input pair is done here */
  }
 
 
  public static void main(String [] args) throws Exception {
  JobConf conf = new JobConf(NerTagger.class);
  conf.setJobName(NerTagger);
 
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
 
  conf.setMapperClass(Map.class);
  conf.setNumReduceTasks(0);
 
  conf.setInputFormat(TextInputFormat.class);
  conf.setOutputFormat(TextOutputFormat.class);
 
  conf.set(mapred.job.tracker, local);
  conf.set(fs.default.name, file:///);
 
  DistributedCache.addCacheFile(new
  URI(/home/akhil1988/Ner/OriginalNer/Data/), conf);
  DistributedCache.addCacheFile(new
  URI(/home/akhil1988/Ner/OriginalNer/Config/), conf);
  DistributedCache.createSymlink(conf);
 
 
  conf.set(mapred.child.java.opts,-Xmx4096m);
 
  FileInputFormat.setInputPaths(conf, new Path(args[0]));
  FileOutputFormat.setOutputPath(conf, new
 Path(args[1]));
 
  System.out.println(HI1\n);
 
  JobClient.runJob(conf);
  }
 
  Jason, when the program executes HI1 and HI2 are printed but it does
 not
  reaches HI3. In the statement
  Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); it
 is
  able to access Config/allLayer1.config file

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-17 Thread akhil1988

Hi Jason!

Thanks for going with me to solve my problem.

To restate things and make it more easier to understand: I am working in
local mode in the directory which contains the job jar and also the Config
and Data directories.

I just removed the following three statements from my code:
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Data/), conf);
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Config/), conf);
 DistributedCache.createSymlink(conf);

The program executes till the same point as before now also and terminates.
That means the above three statements are of no use while working in local
mode. In local mode, the working directory for the mapreduce tasks becomes
the current woking direcotry in which you started the hadoop command to
execute the job.

Since I have removed the DistributedCache.add. statements there should
be no issue whether I am giving a file name or a directory name as argument
to it. Now it seems to me that there is some problem in reading the binary
file using binaryRead.

Please let me know if I am going wrong anywhere.

Thanks,
Akhil
 




jason hadoop wrote:
 
 I have only ever used the distributed cache to add files, including binary
 files such as shared libraries.
 It looks like you are adding a directory.
 
 The DistributedCache is not generally used for passing data, but for
 passing
 file names.
 The files must be stored in a shared file system (hdfs for simplicity)
 already.
 
 The distributed cache makes the names available to the tasks, and the the
 files are extracted from hdfs and stored in the task local work area on
 each
 task tracker node.
 It looks like you may be storing the contents of your files in the
 distributed cache.
 
 On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com wrote:
 

 Thanks Jason.

 I went inside the code of the statement and found out that it eventually
 makes some binaryRead function call to read a binary file and there it
 strucks.

 Do you know whether there is any problem in giving a binary file for
 addition to the distributed cache.
 In the statement DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory
 which contains some text as well as some binary files. In the statement
 Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I
 can
 see(in the output messages) that it is able to read the text files but it
 gets struck at the binary files.

 So, I think here the problem is: it is not able to read the binary files
 which either have not been transferred to the cache or a binary file
 cannot
 be read.

 Do you know the solution to this?

 Thanks,
 Akhil


 jason hadoop wrote:
 
  Something is happening inside of your (Parameters.
  readConfigAndLoadExternalData(Config/allLayer1.config);)
  code, and the framework is killing the job for not heartbeating for 600
  seconds
 
  On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com
 wrote:
 
 
  One more thing, finally it terminates there (after some time) by
 giving
  the
  final Exception:
 
  java.io.IOException: Job failed!
 at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
  at LbjTagger.NerTagger.main(NerTagger.java:109)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 
 
  akhil1988 wrote:
  
   Thank you Jason for your reply.
  
   My Map class is an inner class and it is a static class. Here is the
   structure of my code.
  
   public class NerTagger {
  
   public static class Map extends MapReduceBase implements
   MapperLongWritable, Text, Text, Text{
   private Text word = new Text();
   private static NETaggerLevel1 tagger1 = new
   NETaggerLevel1();
   private static NETaggerLevel2 tagger2 = new
   NETaggerLevel2();
  
   Map(){
   System.out.println(HI2\n);
  
   Parameters.readConfigAndLoadExternalData(Config/allLayer1.config);
   System.out.println(HI3\n);
  
  
 Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true);
  
   System.out.println(loading the tagger);
  
  
 
 tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1);
   System.out.println(HI5\n);
  
  
 
 tagger2

Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread akhil1988

Hi All,

I am running my mapred program in local mode by setting
mapred.jobtracker.local to local mode so that I can debug my code. 
The mapred program is a direct porting of my original sequential code. There
is no reduce phase.
Basically, I have just put my program in the map class.

My program takes around 1-2 min. in instantiating the data objects which are
present in the constructor of Map class(it loads some data model files,
therefore it takes some time). After the instantiation part in the
constrcutor of Map class the map function is supposed to process the input
split.

The problem is that the data objects do not get instantiated completely and
in between(whlie it is still in constructor) the program stops giving the 
exceptions pasted at bottom.
The program runs fine without mapreduce and does not require more than 2GB
memory, but in mapreduce even after doing export HADOOP_HEAPSIZE=2500(I am
working on machines with 16GB RAM), the program fails. I have also set
HADOOP_OPTS=-server -XX:-UseGCOverheadLimit as sometimes I was getting GC
Overhead Limit Exceeded exceptions also.

Somebody, please help me with this problem: I have trying to debug it for
the last 3 days, but unsuccessful. Thanks!

java.lang.OutOfMemoryError: Java heap space
at sun.misc.FloatingDecimal.toJavaFormatString(FloatingDecimal.java:889)
at java.lang.Double.toString(Double.java:179)
at java.text.DigitList.set(DigitList.java:272)
at java.text.DecimalFormat.format(DecimalFormat.java:584)
at java.text.DecimalFormat.format(DecimalFormat.java:507)
at java.text.NumberFormat.format(NumberFormat.java:269)
at 
org.apache.hadoop.util.StringUtils.formatPercent(StringUtils.java:110)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1147)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

09/06/16 12:34:41 WARN mapred.LocalJobRunner: job_local_0001
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:79)
... 5 more
Caused by: java.lang.ThreadDeath
at java.lang.Thread.stop(Thread.java:715)
at 
org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
at
org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1224)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

-- 
View this message in context: 
http://www.nabble.com/Nor-%22OOM-Java-Heap-Space%22-neither-%22GC-OverHead-Limit-Exeeceded%22-tp24059508p24059508.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread akhil1988

One more thing, finally it terminates there (after some time) by giving the
final Exception:

java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


akhil1988 wrote:
 
 Thank you Jason for your reply. 
 
 My Map class is an inner class and it is a static class. Here is the
 structure of my code.
 
 public class NerTagger {
 
 public static class Map extends MapReduceBase implements
 MapperLongWritable, Text, Text, Text{
 private Text word = new Text();
 private static NETaggerLevel1 tagger1 = new
 NETaggerLevel1();
 private static NETaggerLevel2 tagger2 = new
 NETaggerLevel2();
 
 Map(){
 System.out.println(HI2\n);

 Parameters.readConfigAndLoadExternalData(Config/allLayer1.config);
 System.out.println(HI3\n);

 Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true);
 
 System.out.println(loading the tagger);

 tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1);
 System.out.println(HI5\n);

 tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2);
 System.out.println(Done- loading the tagger);
 }
 
 public void map(LongWritable key, Text value,
 OutputCollectorText, Text output, Reporter reporter ) throws IOException
 {
 String inputline = value.toString();
 
 /* Processing of the input pair is done here */
 }
 
 
 public static void main(String [] args) throws Exception {
 JobConf conf = new JobConf(NerTagger.class);
 conf.setJobName(NerTagger);
 
 conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(IntWritable.class);
 
 conf.setMapperClass(Map.class);
 conf.setNumReduceTasks(0);
 
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);
 
 conf.set(mapred.job.tracker, local);
 conf.set(fs.default.name, file:///);
 
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Data/), conf);
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Config/), conf);
 DistributedCache.createSymlink(conf);
 
 
 conf.set(mapred.child.java.opts,-Xmx4096m);
 
 FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
 System.out.println(HI1\n);
 
 JobClient.runJob(conf);
 }
 
 Jason, when the program executes HI1 and HI2 are printed but it does not
 reaches HI3. In the statement
 Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); it is
 able to access Config/allLayer1.config file (as while executing this
 statement, it prints some messages like which data it is loading, etc.)
 but it gets stuck there(while loading some classifier) and never reaches
 HI3. 
 
 This program runs fine when executed normally(without mapreduce).
 
 Thanks, Akhil
 
 
 
 
 jason hadoop wrote:
 
 Is it possible that your map class is an inner class and not static?
 
 On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com wrote:
 

 Hi All,

 I am running my mapred program in local mode by setting
 mapred.jobtracker.local to local mode so that I can debug my code.
 The mapred program is a direct porting of my original sequential code.
 There
 is no reduce phase.
 Basically, I have just put my program in the map class.

 My program takes around 1-2 min. in instantiating the data objects which
 are
 present in the constructor of Map class(it loads some data model files,
 therefore it takes some time). After the instantiation part in the
 constrcutor of Map class the map function is supposed to process the
 input
 split.

 The problem is that the data objects do not get instantiated

Re: Implementing CLient-Server architecture using MapReduce

2009-06-08 Thread akhil1988


Can anyone help me on this issue. I have an account on the cluster and I
cannot go and start server on each server process on each tasktracker.

Akhil

akhil1988 wrote:
 
 Hi All,
 
 I am porting a machine learning application on Hadoop using MapReduce. The
 architecture of the application goes like this: 
 1. run a number of server processes which take around 2-3 minutes to start
 and then remain as daemon waiting for a client to call for a connection.
 During the startup these server processes get trained on the trainng
 dataset.
 
 2. A client is then run which connects to servers and process or test any
 data that it wants to. The client is basically our job, which we will be
 converted to the mapreduce model of hadoop.
 
 Now, since each server takes a good amount of time to start, needless to
 say that we want each of these server processes to be pre-running on all
 the tasktrackers(all nodes) so that when a mapreduce(client) task come to
 that node, the servers are already running and the client just uses them
 for its purpose. The server process keeps on running waiting for another
 map task that may be assigned to that node.
 
 
 That means, a server process is started on each node once and it waits for
 a connection by a client. When clients( implemeted as map reduce) come to
 that node they connect to the server, do they their processing and
 leave(or finish).
 
 Can you please tell me how should I go about starting the server on each
 node. If I am not clear, please ask any questions. Any help in this regard
 will be greatly appreciated.
 
 Thank You!
 Akhil
 
 

-- 
View this message in context: 
http://www.nabble.com/Implementing-CLient-Server-architecture-using-MapReduce-tp23916757p23928505.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Implementing CLient-Server architecture using MapReduce

2009-06-07 Thread akhil1988

Hi All,

I am porting a machine learning application on Hadoop using MapReduce. The
architecture of the application goes like this: 
1. run a number of server processes which take around 2-3 minutes to start
and then remain as daemon waiting for a client to call for a connection.
During the startup these server processes get trained on the trainng
dataset.

2. A client is then run which connects to servers and process or test any
data that it wants to. The client is basically our job, which we will be
converted to the mapreduce model of hadoop.

Now, since each server takes a good amount of time to start, needless to say
that we want each of these server processes to be pre-running on all the
tasktrackers(all nodes) so that when a mapreduce(client) task come to that
node, the servers are already running and the client just uses them for its
purpose. The server process keeps on running waiting for another map task
that may be assigned to that node.


That means, a server process is started on each node once and it waits for a
connection by a client. When clients( implemeted as map reduce) come to that
node they connect to the server, do they their processing and leave(or
finish).

Can you please tell me how should I go about starting the server on each
node. If I am not clear, please ask any questions. Any help in this regard
will be greatly appreciated.

Thank You!
Akhil

-- 
View this message in context: 
http://www.nabble.com/Implementing-CLient-Server-architecture-using-MapReduce-tp23916757p23916757.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Giving classpath in hadoop jar command i.e. while executing a mapreduce job

2009-06-05 Thread akhil1988

I wish to give a path of a jar file as an argument when executing the hadoop
jar .  command as my mapper uses that jar file for its operation. I
found that -libjars option can be used but for me it is not working, it is
giving an exception. Can anyone tell, how to use libjars generic command
option and if any change in code that needs to be made while using libjars.

Thank You in Advance!
Akhil Langer

-- 
View this message in context: 
http://www.nabble.com/Giving-classpath-in-%22hadoop-jar%22-command-i.e.-while-executing-a-mapreduce-job-tp23893761p23893761.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Processing files lying in a directory structure

2009-06-04 Thread akhil1988

Hi! 

I am working on applying WordCount example on the entire Wikipedia dump. The
entire english wikipedia is around 200GB which I have stored in HDFS in a
cluster to which I have access. 
The problem: Wikipedia dump contains many directories (it has a very big
directory structure) containing HTML files but the FileInputFormat requires
all the files to be processed present in a single directory. 

Can anybody give any idea or if something already exists for applying
Wordcount on these HTML files present in the directories without changing
the directory strcuture.

Akhil
-- 
View this message in context: 
http://www.nabble.com/Processing-files-lying-in-a-directory-structure-tp23875340p23875340.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.