Re: MapReduce job temp input files
Hi, I also see this on the webUI: Number of Blocks Pending Deletion: 1 how to delete the invalidate blocks immediately without restart cluster. Thanks Tang On 2014/10/29 13:11:28, Tang shawndow...@gmail.com wrote: hi, We are running mapreduce jobs on hadoop clusters. The job inputs come from logs which are not in HDFS. so we need first copy them into HDFS. After job finished, delete them. Recently the cluster become very unstable. the HDFS disk are prone to full. in fact total valid files are only several Gega bytes. many invalid blocks are on the disk. After reboot the cluster, they are deleted automatically. It seems that restart datanode only willn't work, the namenode willn't send delete block command to datanode. for this case, any ideas? Regards Tang
[HDFS] result order of getFileBlockLocations() and listFiles()?
hi, Guys, I am trying to implement a simple program(that is not for production, experimental). And invoke FileSystem.listFiles() to get a list of files under a hdfs folder, and then use FileSystem.getFileBlockLocations() to get replica locations of each file/blocks. Since it is a controlled environment, I can make sure the files are static and don't worry about datanode crash, fail-over, etc. Assuming at a small time-window(say, 1 minute), I have 100~1000s client invoke the same program to look up the same folder. Will the above two APIs guarantee *same result in the same order* for all clients? To elaborate a bit more, say there is a folder called /dfs/dn/user/data contains three files: file1, file2, and file3. If client1 gets: listFiles() : file1,file2,file3 getFileBlockLocation(file1) - datanode1, datanode3, datanode6 Will all other clients get the same information(I think so) and in the same order? or I have to do a sort by each client to guarantee the order? Many thanks for your inputs Demai
Re: run arbitrary job (non-MR) on YARN ?
You can accomplish this by using the DistributedShell application that comes with YARN. If you copy all your archives to HDFS, then inside your shell script you could copy those archives to your YARN container and then execute whatever you want, provided all the other system dependencies exist in the container (correct Java version, Python, C++ libraries, etc.) For example, In myscript.sh I wrote the following: #!/usr/bin/env bash echo This is my script running! echo Present working directory: pwd echo Current directory listing: (nothing exciting yet) ls echo Copying file from HDFS to container hadoop fs -get /path/to/some/data/on/hdfs . echo Current directory listing: (file should not be here) ls echo Cat ExecScript.sh (this is the script created by the DistributedShell application) cat ExecScript.sh Run the DistributedShell application with the hadoop (or yarn) command: hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar /usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar -num_containers 1 -shell_script myscript.sh If you have the YARN log aggregation property set, then you can pipe the container's logs to your client console using the yarn command: yarn logs -applicationId application_1414160538995_0035 (replace the application id with yours) Here is a quick reference that should help get you going: http://books.google.com/books?id=heoXAwAAQBAJpg=PA227lpg=PA227dq=hadoop+yarn+distributed+shell+applicationsource=blots=psGuJYlY1Ysig=khp3b3hgzsZLZWFfz7GOe2yhgyYhl=ensa=Xei=0U5RVKzDLeTK8gGgoYGoDQved=0CFcQ6AEwCA#v=onepageqf=false Hopefully this helps, Kevin On Mon Oct 27 2014 at 2:21:18 AM Yang tedd...@gmail.com wrote: I happened to run into this interesting scenario: I had some mahout seq2sparse jobs, originally i run them in parallel using the distributed mode. but because the input files are so small, running them locally actually is much faster. so I truned them to local mode. but I run 10 of these jobs in parallel, so when 10 mahout jobs are run together, everyone became very slow. is there an existing code that takes a desired shell script, and possibly some archive files (could contain the jar file, or C++ --generated executable code). I understand that I could use yarn API to code such a thing, but it would be nice if I could just take it and run in shell.. Thanks Yang
Fwd: problems with Hadoop instalation
All, I am new to Hadoop so any help would be appreciated. I have a question for the mailing list regarding Hadoop. I have installed the most recent stable version (2.4.1) on a virtual machine running CentOS 7. I have tried to run this command %Hadoop -fs ls but without success. The question is, what does Hadoop consider a valid JAVA_HOME directory? And where should the JAVA_HOME directory variable be defined? I installed Java using the package manager yum. I installed the most recent version, detailed below. T his is in my .bashrc file # The java implementation to use. export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 [david@localhost ~]$ hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java: No such file or directory then I tried this value for JAVA_HOME in my .bashrc file. /usr/bin/Java. [david@localhost ~]$ which java /usr/bin/java [david@localhost ~]$ java -version java version 1.7.0_71 OpenJDK Runtime Environment (rhel-2.5.3.1.el7_0-x86_64 u71-b14) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) here is the result: [david@localhost ~]$hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/bin/java/bin/java: Not a directory /usr/local/hadoop/bin/hadoop: line 133: exec: /usr/bin/java/bin/java: cannot execute: Not a directory David Novogrodsky
Re: problems with Hadoop instalation
Are RHEL7 based OSs supported? On Wed, Oct 29, 2014 at 3:59 PM, David Novogrodsky david.novogrod...@gmail.com wrote: All, I am new to Hadoop so any help would be appreciated. I have a question for the mailing list regarding Hadoop. I have installed the most recent stable version (2.4.1) on a virtual machine running CentOS 7. I have tried to run this command %Hadoop -fs ls but without success. The question is, what does Hadoop consider a valid JAVA_HOME directory? And where should the JAVA_HOME directory variable be defined? I installed Java using the package manager yum. I installed the most recent version, detailed below. T his is in my .bashrc file # The java implementation to use. export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 [david@localhost ~]$ hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java: No such file or directory then I tried this value for JAVA_HOME in my .bashrc file. /usr/bin/Java. [david@localhost ~]$ which java /usr/bin/java [david@localhost ~]$ java -version java version 1.7.0_71 OpenJDK Runtime Environment (rhel-2.5.3.1.el7_0-x86_64 u71-b14) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) here is the result: [david@localhost ~]$hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/bin/java/bin/java: Not a directory /usr/local/hadoop/bin/hadoop: line 133: exec: /usr/bin/java/bin/java: cannot execute: Not a directory David Novogrodsky
Re: problems with Hadoop instalation
HI David, JAVA_HOME should point to the java installation directory. Typically, this directory will contain a subdirectory called 'bin'. Hadoop tries to find the java command in $JAVA_HOME/bin/java. It is likely that /usr/bin/java is a symlink to some other file. If you do an ls -l /usr/bin/java, you should be able to see where that symlink points to. If the symlink points to a path that is of the form base_dir/bin/java, then base_dir should be the value of JAVA_HOME. HTH, Bhooshan On Wed, Oct 29, 2014 at 3:59 PM, David Novogrodsky david.novogrod...@gmail.com wrote: All, I am new to Hadoop so any help would be appreciated. I have a question for the mailing list regarding Hadoop. I have installed the most recent stable version (2.4.1) on a virtual machine running CentOS 7. I have tried to run this command %Hadoop -fs ls but without success. The question is, what does Hadoop consider a valid JAVA_HOME directory? And where should the JAVA_HOME directory variable be defined? I installed Java using the package manager yum. I installed the most recent version, detailed below. T his is in my .bashrc file # The java implementation to use. export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 [david@localhost ~]$ hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java: No such file or directory then I tried this value for JAVA_HOME in my .bashrc file. /usr/bin/Java. [david@localhost ~]$ which java /usr/bin/java [david@localhost ~]$ java -version java version 1.7.0_71 OpenJDK Runtime Environment (rhel-2.5.3.1.el7_0-x86_64 u71-b14) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) here is the result: [david@localhost ~]$hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/bin/java/bin/java: Not a directory /usr/local/hadoop/bin/hadoop: line 133: exec: /usr/bin/java/bin/java: cannot execute: Not a directory David Novogrodsky
RE: problems with Hadoop instalation
Try to add “/” at the end of hadoop fs -ls So it will become Hadoop fs -ls / From: David Novogrodsky [mailto:david.novogrod...@gmail.com] Sent: Thursday, October 30, 2014 7:00 AM To: user@hadoop.apache.org Subject: Fwd: problems with Hadoop instalation All, I am new to Hadoop so any help would be appreciated. I have a question for the mailing list regarding Hadoop. I have installed the most recent stable version (2.4.1) on a virtual machine running CentOS 7. I have tried to run this command %Hadoop -fs ls but without success. The question is, what does Hadoop consider a valid JAVA_HOME directory? And where should the JAVA_HOME directory variable be defined? I installed Java using the package manager yum. I installed the most recent version, detailed below. T his is in my .bashrc file # The java implementation to use. export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 [david@localhost ~]$ hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java: No such file or directory then I tried this value for JAVA_HOME in my .bashrc file. /usr/bin/Java. [david@localhost ~]$ which java /usr/bin/java [david@localhost ~]$ java -version java version 1.7.0_71 OpenJDK Runtime Environment (rhel-2.5.3.1.el7_0-x86_64 u71-b14) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) here is the result: [david@localhost ~]$hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/bin/java/bin/java: Not a directory /usr/local/hadoop/bin/hadoop: line 133: exec: /usr/bin/java/bin/java: cannot execute: Not a directory David Novogrodsky The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond.