When you said that the mappers seem to be accessing file sequentially, why do 
you think so?
NFS maybe changes something, but mappers shouldn't access file sequentially. 
NFS could make the file unsplittable, but you need to more test to verify it.
The class you want to check out is the 
org.apache.hadoop.mapred.FileInputFormat, especially method getSplits().
The above code is the key how the split list is generated. If it doesn't 
performance well for your underline storage system, you can always write your 
own InputFormat to utilize your own storage system.
Yong

From: atish.kath...@gmail.com
Date: Wed, 8 Jan 2014 15:48:12 +0530
Subject: Re: Running Hadoop v2 clustered mode MR on an NFS mounted filesystem
To: user@hadoop.apache.org

Figured out 1. The output of the reduce was going to the slave node, while I 
was looking for it in the master node. Which is perfectly fine. Need guidance 
for 2. though!


ThanksAtish

On Wed, Jan 8, 2014 at 3:30 PM, Atish Kathpal <atish.kath...@gmail.com> wrote:



Hi
By giving the complete URI, the MR jobs worked across both nodes. Thanks a lot 
for the advice. 



Two issues though:1. On completion of the MR job, I see only the "_SUCCESS" 
file in the output directory, but no part-r file containing the actual results 
of the wordcount job. However I am seeing the correct output on running MR over 
HDFS. What is going wrong? Any place I can find logs for the MR job. I see no 
errors on the console.



Command used: hadoop jar 
/home/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar 
wordcount file:///home/hduser/testmount/ file:///home/hduser/testresults/





2. I am observing that the mappers seem to be accessing files sequentially, 
splitting the files across mappers, and then reading data in parallelel, then 
moving on to the next file. What I want instead is that, files themselves 
should be accessed in parallel, that is, if there are 10 files to be MRed, then 
MR should ask for each of these files in parallel in one go, and then work on 
the splits of these files in parallel.



Why do I need this? Some of the data coming from the NFS mount point is coming 
from offline media (which takes ~5-10 seconds of time before first bytes are 
received). So I would like all required files to be asked at the onset itself 
from the NFS mount point. This way several offline media will be spun up 
parallely and as the data from these media gets available MR can process them.




Would be glad to get inputs on these points!
ThanksAtish
Tip for those who are trying similar stuff::In my case. after a while the jobs 
would fail, complaining of "java.lang.OutOfMemoryError: Java heap space", but I 
was able to rectify this with help from: 
http://stackoverflow.com/questions/13674190/cdh-4-1-error-running-child-java-lang-outofmemoryerror-java-heap-space









On Sun, Dec 22, 2013 at 2:47 PM, Atish Kathpal <atish.kath...@gmail.com> wrote:




Thanks Devin, Yong, and Chris for your replies and suggestions. I will test the 
suggestions made by Yong and Devin and get back to you guys.




As on the bottlenecking issue, I agree, but  I am trying to run few MR jobs on 
a traditional NAS server. I can live with a few bottlenecks, so long as I don't 
have to move the data to a dedicated HDFS cluster.






On Sat, Dec 21, 2013 at 8:06 AM, Chris Mawata <chris.maw...@gmail.com> wrote:






  
    
  
  
    Yong raises an important issue:  You
      have thrown out the I/O advantages of HDFS and also thrown out the
      advantages of data locality. It would be interesting to know why
      you are taking this approach.

      Chris

      

      On 12/20/2013 9:28 AM, java8964 wrote:

    
    
      
      I believe the "-fs local" should be removed too.
        The reason is that even you have a dedicated JobTracker after
        removing "-jt local", but with "-fs local", I believe that all
        the mappers will be run sequentially.
        

        
        "-fs local" will force the mapreducer run in "local" mode,
          which is really a test mode.
        

        
        What you can do is to remove both "-fs local -jt local",
          but give the FULL URI of the input and output path, to tell
          Hadoop that they are local filesystem instead of HDFS.
        

        
        "hadoop jar
          
/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

          wordcount file:///hduser/mount_point
          file:///results"
        

        
        Keep in mind followings:
        

        
        1) The NFS mount need to be available in all your Task
          Nodes, and mounted in the same way.
        2) Even you can do that, but your sharing storage will be
          your bottleneck. NFS won't work well for scalability. 
        

        
        Yong

          

          
            Date: Fri, 20 Dec 2013 09:01:32 -0500

            Subject: Re: Running Hadoop v2 clustered mode MR on an NFS
            mounted filesystem

            From: dsui...@rdx.com

            To: user@hadoop.apache.org

            

            I think most of your problem is coming from
              the options you are setting:
              

              
              "hadoop jar
                
/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

                wordcount -fs local -jt local
                /hduser/mount_point/  /results"

              
              

              
              You appear to be directing your namenode to run jobs
                in the LOCAL job runner and directing it to read
                from the LOCAL filesystem. Drop the -jt
                argument and it should run in distributed mode if your
                cluster is set up right. You don't need to do anything
                special to point Hadoop towards a NFS location, other
                than set up the NFS location properly and make sure if
                you are directing to it by name that it will resolve to
                the right address. Hadoop doesn't care where it is, as
                long as it can read from and write to it. The fact that
                you are telling it to read/write from/to a NFS location
                that happens to be mounted as a local filesystem object
                doesn't matter - you could direct it to the local
                /hduser/ path and set the -fs local option, and it would
                end up on the NFS mount, because that's where the NFS
                mount actually exists, or you could direct it to the
                absolute network location of the folder that you want,
                it shouldn't make a difference.
            
            
              
                Devin Suiter
                  
                    Jr. Data Solutions Software Engineer
                    
                      
                       100 Sandusky Street | 2nd Floor |
                        Pittsburgh, PA 15212

                        Google Voice: 412-256-8556 | www.rdx.com
                    
                  
                
              
              

              

              On Fri, Dec 20, 2013 at 5:27
                AM, Atish Kathpal <atish.kath...@gmail.com>
                wrote:

                
                  Hello 
                    

                    
                    The picture below describes the deployment
                      architecture I am trying to achieve. 
                    However, when I run the wordcount example code
                      with the below configuration, by issuing the
                      command from the master node, I notice only the
                      master node spawning map tasks and completing the
                      submitted job. Below is the command I used:
                    

                    
                    
                      hadoop jar
                          
/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

                          wordcount -fs local -jt local
                          /hduser/mount_point/  /results
                    
                    

                    
                    Question: How can I leverage both the hadoop
                        nodes for running MR, while serving my data from
                        the common NFS mount point running my filesystem
                        at the backend? Has any one tried such a setup
                        before?
                    

                    
                    

                    
                    Thanks!
                  
                
              
              

            
          
        
      
    
    

  






                                          

<<inline: ATT00001>>

Reply via email to