Re: Decommissioning Nodes
Hey Alyssa, If one of those datanodes down, a few minutes will pass when master discover this phenomenon. Master node takes those nodes which have not send heatbeat for quite a while as dead ones. On Thu, Jan 22, 2009 at 8:34 AM, Hargraves, Alyssa wrote: > Hello Hadoop Users, > > I was hoping someone would be able to answer a question about node > decommissioning. I have a test Hadoop cluster set up which only consists of > my computer and a master node. I am looking at the removal and addition of > nodes. Adding a node is nearly instant (only about 5 seconds), but removing > a node by decommissioning it takes a while, and I don't understand why. > Currently, the systems are running no map/reduce tasks and storing no data. > DFS Health reports: > > 7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31 > MB (0%) > Capacity: 298.02 GB > DFS Remaining : 245.79 GB > DFS Used: 4 KB > DFS Used% : 0 % > Live Nodes : 2 > Dead Nodes : 0 > > Node Last ContactAdmin State Size (GB) Used (%) > Used (%)Remaining (GB) Blocks > master 0 In Service 149.01 0 >122.22 0 > slave 82 Decommission In Progress149.01 0 >123.58 0 > > However, even with nothing stored and nothing running, the decommission > process takes 3 to 5 minutes, and I'm not quite sure why. There isn't any > data to move anywhere, and there aren't any jobs to worry about. I am using > 0.18.2. > > Thank you for any help in solving this, > Alyssa Hargraves -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com
Decommissioning Nodes
Hello Hadoop Users, I was hoping someone would be able to answer a question about node decommissioning. I have a test Hadoop cluster set up which only consists of my computer and a master node. I am looking at the removal and addition of nodes. Adding a node is nearly instant (only about 5 seconds), but removing a node by decommissioning it takes a while, and I don't understand why. Currently, the systems are running no map/reduce tasks and storing no data. DFS Health reports: 7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31 MB (0%) Capacity: 298.02 GB DFS Remaining : 245.79 GB DFS Used: 4 KB DFS Used% : 0 % Live Nodes : 2 Dead Nodes : 0 Node Last ContactAdmin State Size (GB) Used (%)Used (%)Remaining (GB) Blocks master 0 In Service 149.01 0 122.22 0 slave 82 Decommission In Progress149.01 0 123.58 0 However, even with nothing stored and nothing running, the decommission process takes 3 to 5 minutes, and I'm not quite sure why. There isn't any data to move anywhere, and there aren't any jobs to worry about. I am using 0.18.2. Thank you for any help in solving this, Alyssa Hargraves
Re: using distcp for http source files
Derek Young wrote: Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this should be supported, but the http URLs are not working for me. Are http source URLs still supported? No. They used to be supported, but when distcp was converted to accept any Path this stopped working, since there is no FileSystem implementation mapped to http: paths. Implementing an HttpFileSystem that supports read-only access to files and no directory listings is fairly trivial, but without directory listings, distcp would not work well. https://issues.apache.org/jira/browse/HADOOP-1563 includes a now long-stale patch that implements an HTTP filesystem, where directory listings are implemented, assuming that: - directories are represented by slash-terminated urls; - GET of a directory contains the URLs of its children This works for the directory listings returned by many HTTP servers. Perhaps someone can update this patch, and, if folks find it useful, we can include it. Doug
RE: Hadoop User Group Meeting (Bay Area) 1/21
Reminder - the Bay Area Hadoop User Group meeting is today at 6 pm. From: Ajay Anand Sent: Thursday, January 08, 2009 12:10 PM To: 'core-user@hadoop.apache.org'; 'gene...@hadoop.apache.org'; 'zookeeper-u...@hadoop.apache.org'; 'hbase-u...@hadoop.apache.org'; pig-u...@hadoop.apache.org Subject: Hadoop User Group Meeting (Bay Area) 1/21 The next Bay Area Hadoop User Group meeting is scheduled for Wednesday, January 21st at Yahoo! 2821 Mission College Blvd, Santa Clara, Building 1, Training Rooms 3 & 4 from 6:00-7:30 pm. Agenda: Hadoop 0.20 Overview - Sameer Paranjpye Hadoop 1.0 discussion - Sanjay Radia. Please send me any other suggestions for topics for future meetings. Registration: http://upcoming.yahoo.com/event/1478614 Look forward to seeing you there! Ajay
running hadoop on heterogeneous hardware
Is hadoop designed to run on homogeneous hardware only, or does it work just as well on heterogeneous hardware as well? If the datanodes have different disk capacities, does HDFS still spread the data blocks equally amount all the datanodes, or will the datanodes with high disk capacity end up storing more data blocks? Similarily, if the tasktrackres have different numbers of CPUs, is there a way to configure hadoop to run more tasks on those tasktrackers that have more CPUs? Is that simply a matter of setting mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum differently on the tasktrackers? Bill
Re: using distcp for http source files
Tsz Wo (Nicholas), Sze writes: > > Hi Derek, > > The "http" in "http://core:7274/logs/log.20090121"; should be "hftp". hftp is the scheme name of > HftpFileSystem which uses http for accessing hdfs. > > Hope this helps. > > Nicholas Sze I thought hftp is used to talk to servlets that act as a gateway to hdfs right? In my case these will be servers that are serving up static log files, running no servlets. I believe this is the scenario that HADOOP-341 describes: "Enhance it [distcp] to handle http as the source protocol i.e. support copying files from arbitrary http-based sources into the dfs." In any case if I just use hftp instead of http I get this error: bin/hadoop distcp -f hftp://core:7274/logs/log.20090121 /user/dyoung/mylogs With failures, global counters are inaccurate; consider running with -i Copy failed: java.io.IOException: Server returned HTTP response code: 400 for URL: http://core:7274/data/logs/log.20090121? ugi=dyoung,dyoung,adm,dialout,fax,cdrom,cdrom,\ floppy,floppy,tape,audio,audio,dip,dip,video,video,\ plugdev,plugdev,admin,users,scanner,fuse,fuse,lpadmin,\ admin,vboxusers at sun.net.www.protocol.http.HttpURLConnection.getInputStream (HttpURLConnection.java:1241) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:124) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:359) at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:581) at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74) at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775) at org.apache.hadoop.tools.DistCp.run(DistCp.java:844) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:871) > > - Original Message > > From: Derek Young > > To: core-u...@... > > Sent: Wednesday, January 21, 2009 1:23:56 PM > > Subject: using distcp for http source files > > > > I plan to use hadoop to do some log processing and I'm working on a method to > > load the files (probably nightly) into hdfs. My plan is to have a web server on > > each machine with logs that serves up the log directories. Then I would give > > distcp a list of http URLs of the log files and have it copy the files in. > > > > Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this > > should be supported, but the http URLs are not working for me. Are http source > > URLs still supported? > > > > I tried a simple test with an http source URL (using Hadoop 0.19): > > > > hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs > > > > This fails: > > > > With failures, global counters are inaccurate; consider running with -i > > Copy failed: java.io.IOException: No FileSystem for scheme: http > >at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364) > >at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56) > >at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379) > >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) > >at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) > >at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578) > >at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74) > >at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775) > >at org.apache.hadoop.tools.DistCp.run(DistCp.java:844) > >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > >at org.apache.hadoop.tools.DistCp.main(DistCp.java:871) > >
RE: Problem running hdfs_test
Hello, As I mentioned in my previous email, I am having segmentation fault at 0x0001 while running hdfs_test. I was suggested to build and run hdfs_test usning ant, as ant should set some environment variable which Makefile won't. I tried building libhdfs and running hdfs_test using ant but I am still having same problem. Now, instead of hdfs_test, I am testing a simple test with libhdfs. I linked a following hello world program with libhdfs. #include "hdfs.h" int main() { printf("Hello World.\n"); return(0); } I added a line to compile this test program in ${HADOOP_HOME}/src/c++/libhdfs/Makefile and replaced hdfs_test with this test program in {HADOOP_HOME}/src/c++/libhdfs/tests/test-libhdfs.sh. I build and invoked this test using test-libhdfs target in build.xml but I am still having segmentation fault when this simple test program is invoked from test-libhdfs.sh. I followed the following steps cd ${HADOOP_HOME} ant clean cd ${HADOOP_HOME}/src/c++/libhdfs/ rm -f hdfs_test hdfs_write hdfs_read libhdfs.so* *.o test Cd ${HADOOP_HOME} ant test-libhdfs -Dlibhdfs=1 Error Line -- [exec] ./tests/test-libhdfs.sh: line 85: 23019 Segmentation fault $LIBHDFS_BUILD_DIR/$HDFS_TEST I have attached the output of this command with this email. I have added "env" in test-libhdfs.sh to see what environmental variable are set. Please suggest if any variable is wrongly set. Any kind of suggestion will be helpful for me as I have already spent a lot of time on this problem. I have added following lines in Makefile and test-libhdfs.sh Makefile - export JAVA_HOME=/usr/lib/jvm/java-1.7.0-icedtea-1.7.0.0.x86_64 export OS_ARCH=amd64 export OS_NAME=Linux export LIBHDFS_BUILD_DIR=$(HADOOP_HOME)/src/c++/libhdfs export SHLIB_VERSION=1 test-libhdfs.sh -- HADOOP_CONF_DIR=${HADOOP_HOME}/conf HADOOP_LOG_DIR=${HADOOP_HOME}/logs LIBHDFS_BUILD_DIR=${HADOOP_HOME}/src/c++/libhdfs HDFS_TEST=test When I don't link libdhfs with test.c it doesn't give error and prints "Hello World" when "ant test-libhdfs -Dlibhdfs=1" is run. I made sure that "ant" and "hadoop" uses same java installation, I have tried this on 32 bit machine but I am still having segmentation fault. Now, I am clueless what I can do to correct this. Please help. Thanks, Arifa. PS: Also please suggest is there any java version of hdfs_test? -Original Message- From: Delip Rao [mailto:delip...@gmail.com] Sent: Saturday, January 17, 2009 3:49 PM To: core-user@hadoop.apache.org Subject: Re: Problem running unning hdfs_test Try enabling the debug flags while compiling to get more information. On Sat, Jan 17, 2009 at 4:19 AM, Arifa Nisar wrote: > Hello all, > > > > I am trying to test hdfs_test.c provided with hadoop installation. > libhdfs.so and hdfs_test are built fine after making a few changes in > $(HADOOP_HOME)/src/c++/libhdfs/Makefile. But when I try to run ./hdfs_test, > I get segmentation fault at 0x0001 > > > > Program received signal SIGSEGV, Segmentation fault. > > 0x0001 in ?? () > > (gdb) bt > > #0 0x0001 in ?? () > > #1 0x7fffd0c51af5 in ?? () > > #2 0x in ?? () > > > > A simple hello world program linked with libdhfs.so also gives the same > error. In CLASSPATH all the jar files in $(HADOOP_HOME), > $(HADOOP_HOME)/conf, $(HADOOP_HOME)/lib,$(JAVA_HOME)/lib are included. > Please help. > > > > Thanks, > > Arifa. > > > > Buildfile: build.xml init: [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/classes [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/tools [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/src [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/webapps/task/WEB-INF [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/webapps/job/WEB-INF [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/webapps/hdfs/WEB-INF [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/webapps/datanode/WEB-INF [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/webapps/secondary/WEB-INF [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/examples [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/ant [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/c++ [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/test [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/test/classes [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/test/testjar [mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/test/testshell [touch] Creating /tmp/null333580654 [delete] Deleting: /tmp/null333580654 [copy] Copying 7 files to /home/ani662/hadoop/hadoop-0.19.0/build/webapps [exec] svn: '.' is not a working copy [exec] svn: '.' is not a working copy compile-libhdfs: [mkdir]
Re: using distcp for http source files
Hi Derek, The "http" in "http://core:7274/logs/log.20090121"; should be "hftp". hftp is the scheme name of HftpFileSystem which uses http for accessing hdfs. Hope this helps. Nicholas Sze - Original Message > From: Derek Young > To: core-user@hadoop.apache.org > Sent: Wednesday, January 21, 2009 1:23:56 PM > Subject: using distcp for http source files > > I plan to use hadoop to do some log processing and I'm working on a method to > load the files (probably nightly) into hdfs. My plan is to have a web server > on > each machine with logs that serves up the log directories. Then I would give > distcp a list of http URLs of the log files and have it copy the files in. > > Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this > should be supported, but the http URLs are not working for me. Are http > source > URLs still supported? > > I tried a simple test with an http source URL (using Hadoop 0.19): > > hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs > > This fails: > > With failures, global counters are inaccurate; consider running with -i > Copy failed: java.io.IOException: No FileSystem for scheme: http >at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364) >at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56) >at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379) >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) >at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) >at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578) >at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74) >at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775) >at org.apache.hadoop.tools.DistCp.run(DistCp.java:844) >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)
using distcp for http source files
I plan to use hadoop to do some log processing and I'm working on a method to load the files (probably nightly) into hdfs. My plan is to have a web server on each machine with logs that serves up the log directories. Then I would give distcp a list of http URLs of the log files and have it copy the files in. Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this should be supported, but the http URLs are not working for me. Are http source URLs still supported? I tried a simple test with an http source URL (using Hadoop 0.19): hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs This fails: With failures, global counters are inaccurate; consider running with -i Copy failed: java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578) at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74) at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775) at org.apache.hadoop.tools.DistCp.run(DistCp.java:844) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)
ANN: hbase-0.19.0 release available for download
HBase 0.19.0 is now available for download http://hadoop.apache.org/hbase/releases.html Thanks to all who contributed to this release. 185 issues have been fixed since hbase 0.18.0. Release notes are available here: http://tinyurl.com/8xmyx9 At your service, The HBase Team
RE: Suitable for Hadoop?
Darren- I would definitely use HDFS to get the data to all the instances. I'm not sure about your 32 processes or SQS, but let me/us know what you find. Richard J. Zak -Original Message- From: Ricky Ho [mailto:r...@adobe.com] Sent: Wednesday, January 21, 2009 15:00 To: core-user@hadoop.apache.org Subject: RE: Suitable for Hadoop? Jim, thanks for your explanation. But isn't isSplittable an option in writing output rather than reading input ? There are two phases. 1) Upload the data from local file to HDFS. Is there an option in the "hadoop fs copy" to pack multiple small files in a single block and also not splitting the files. So should "isSplittable" be applied here ? 2) The Mapper will start reading data using MultiFileInputFormat, which will data in units of block size. Therefore a single mapper will get all the files within its assigned block. Right ? Rgds, Ricky -Original Message- From: Jim Twensky [mailto:jim.twen...@gmail.com] Sent: Wednesday, January 21, 2009 11:47 AM To: core-user@hadoop.apache.org Subject: Re: Suitable for Hadoop? Ricky, Hadoop was formerly optimized for large files, usually files of size larger than one input split. However, there is an input format called MultiFileInputFormat which can be used to utilize Hadoop to work efficiently on smaller files. You can also set the isSplittable method of an input format to "false" and ensure that a file is not split into pieces but rather processed by only one mapper. Jim On Wed, Jan 21, 2009 at 9:14 AM, Ricky Ho wrote: > Hmmm ... > > From a space efficiency perspective, given HDFS (with large block > size) is expecting large files, is Hadoop optimized for processing > large number of small files ? Does each file take up at least 1 block > ? or multiple files can sit on the same block. > > Rgds, > Ricky > -Original Message- > From: Zak, Richard [USA] [mailto:zak_rich...@bah.com] > Sent: Wednesday, January 21, 2009 6:42 AM > To: core-user@hadoop.apache.org > Subject: RE: Suitable for Hadoop? > > You can do that. I did a Map/Reduce job for about 6 GB of PDFs to > concatenate them, and the New York times used Hadoop to process a few > TB of PDFs. > > What I would do is this: > - Use the iText library, a Java library for PDF manipulation (don't > know what you would use for reading Word docs) > - Don't use any Reducers > - Have the input be a text file with the directory(ies) to process, > since the mapper takes in file contents (and you don't want to read in > one line of > binary) > - Have the map process all contents for that one given directory from > the input text file > - Break down the documents into more directories to go easier on the > memory > - Use Amazon's EC2, and the scripts in > /src/contrib/ec2/bin/ (there is a script which passes > environment variables to launched instances, modify the script to > allow Hadoop to use more memory by setting the HADOOP_HEAPSIZE > environment variable and having the variable properly > passed) > > I realize this isn't the strong point of Map/Reduce or Hadoop, but it > still uses the HDFS in a beneficial manner, and the distributed part > is very helpful! > > > Richard J. Zak > > -Original Message- > From: Darren Govoni [mailto:dar...@ontrenet.com] > Sent: Wednesday, January 21, 2009 08:08 > To: core-user@hadoop.apache.org > Subject: Suitable for Hadoop? > > Hi, > I have a task to process large quantities of files by converting them > into other formats. Each file is processed as a whole and converted to > a target format. Since there are 100's of GB of data I thought it > suitable for Hadoop, but the problem is, I don't think the files can > be broken apart and processed. For example, how would mapreduce work > to convert a Word Document to PDF if the file is reduced to blocks? > I'm not sure that's possible, or is it? > > thanks for any advice > Darren > > smime.p7s Description: S/MIME cryptographic signature
RE: Suitable for Hadoop?
Jim, thanks for your explanation. But isn't isSplittable an option in writing output rather than reading input ? There are two phases. 1) Upload the data from local file to HDFS. Is there an option in the "hadoop fs copy" to pack multiple small files in a single block and also not splitting the files. So should "isSplittable" be applied here ? 2) The Mapper will start reading data using MultiFileInputFormat, which will data in units of block size. Therefore a single mapper will get all the files within its assigned block. Right ? Rgds, Ricky -Original Message- From: Jim Twensky [mailto:jim.twen...@gmail.com] Sent: Wednesday, January 21, 2009 11:47 AM To: core-user@hadoop.apache.org Subject: Re: Suitable for Hadoop? Ricky, Hadoop was formerly optimized for large files, usually files of size larger than one input split. However, there is an input format called MultiFileInputFormat which can be used to utilize Hadoop to work efficiently on smaller files. You can also set the isSplittable method of an input format to "false" and ensure that a file is not split into pieces but rather processed by only one mapper. Jim On Wed, Jan 21, 2009 at 9:14 AM, Ricky Ho wrote: > Hmmm ... > > From a space efficiency perspective, given HDFS (with large block size) is > expecting large files, is Hadoop optimized for processing large number of > small files ? Does each file take up at least 1 block ? or multiple files > can sit on the same block. > > Rgds, > Ricky > -Original Message- > From: Zak, Richard [USA] [mailto:zak_rich...@bah.com] > Sent: Wednesday, January 21, 2009 6:42 AM > To: core-user@hadoop.apache.org > Subject: RE: Suitable for Hadoop? > > You can do that. I did a Map/Reduce job for about 6 GB of PDFs to > concatenate them, and the New York times used Hadoop to process a few TB of > PDFs. > > What I would do is this: > - Use the iText library, a Java library for PDF manipulation (don't know > what you would use for reading Word docs) > - Don't use any Reducers > - Have the input be a text file with the directory(ies) to process, since > the mapper takes in file contents (and you don't want to read in one line > of > binary) > - Have the map process all contents for that one given directory from the > input text file > - Break down the documents into more directories to go easier on the memory > - Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/ > (there is a script which passes environment variables to launched > instances, > modify the script to allow Hadoop to use more memory by setting the > HADOOP_HEAPSIZE environment variable and having the variable properly > passed) > > I realize this isn't the strong point of Map/Reduce or Hadoop, but it still > uses the HDFS in a beneficial manner, and the distributed part is very > helpful! > > > Richard J. Zak > > -Original Message- > From: Darren Govoni [mailto:dar...@ontrenet.com] > Sent: Wednesday, January 21, 2009 08:08 > To: core-user@hadoop.apache.org > Subject: Suitable for Hadoop? > > Hi, > I have a task to process large quantities of files by converting them into > other formats. Each file is processed as a whole and converted to a target > format. Since there are 100's of GB of data I thought it suitable for > Hadoop, but the problem is, I don't think the files can be broken apart and > processed. For example, how would mapreduce work to convert a Word Document > to PDF if the file is reduced to blocks? I'm not sure that's possible, or > is > it? > > thanks for any advice > Darren > >
Re: Suitable for Hadoop?
Ricky, Hadoop was formerly optimized for large files, usually files of size larger than one input split. However, there is an input format called MultiFileInputFormat which can be used to utilize Hadoop to work efficiently on smaller files. You can also set the isSplittable method of an input format to "false" and ensure that a file is not split into pieces but rather processed by only one mapper. Jim On Wed, Jan 21, 2009 at 9:14 AM, Ricky Ho wrote: > Hmmm ... > > From a space efficiency perspective, given HDFS (with large block size) is > expecting large files, is Hadoop optimized for processing large number of > small files ? Does each file take up at least 1 block ? or multiple files > can sit on the same block. > > Rgds, > Ricky > -Original Message- > From: Zak, Richard [USA] [mailto:zak_rich...@bah.com] > Sent: Wednesday, January 21, 2009 6:42 AM > To: core-user@hadoop.apache.org > Subject: RE: Suitable for Hadoop? > > You can do that. I did a Map/Reduce job for about 6 GB of PDFs to > concatenate them, and the New York times used Hadoop to process a few TB of > PDFs. > > What I would do is this: > - Use the iText library, a Java library for PDF manipulation (don't know > what you would use for reading Word docs) > - Don't use any Reducers > - Have the input be a text file with the directory(ies) to process, since > the mapper takes in file contents (and you don't want to read in one line > of > binary) > - Have the map process all contents for that one given directory from the > input text file > - Break down the documents into more directories to go easier on the memory > - Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/ > (there is a script which passes environment variables to launched > instances, > modify the script to allow Hadoop to use more memory by setting the > HADOOP_HEAPSIZE environment variable and having the variable properly > passed) > > I realize this isn't the strong point of Map/Reduce or Hadoop, but it still > uses the HDFS in a beneficial manner, and the distributed part is very > helpful! > > > Richard J. Zak > > -Original Message- > From: Darren Govoni [mailto:dar...@ontrenet.com] > Sent: Wednesday, January 21, 2009 08:08 > To: core-user@hadoop.apache.org > Subject: Suitable for Hadoop? > > Hi, > I have a task to process large quantities of files by converting them into > other formats. Each file is processed as a whole and converted to a target > format. Since there are 100's of GB of data I thought it suitable for > Hadoop, but the problem is, I don't think the files can be broken apart and > processed. For example, how would mapreduce work to convert a Word Document > to PDF if the file is reduced to blocks? I'm not sure that's possible, or > is > it? > > thanks for any advice > Darren > >
RE: Suitable for Hadoop?
Richard, Thanks for the suggestion. I actually am building an EC2 architecture to facilitate this! I tried using a database to warehouse the files, and then NFS but the connection load is too heavy. So I thought maybe HDFS could be used just too mitigate the data access across all the instances. I have a parallel processing architecture based on SQS queues, but will consider a map process. I have about 32 processes or so per machine in EC2 reading from SQS queues for files to process, they could then efficiently get the files from HDFS, yes? Without bottlenecking access to a database or NFS server? Well, i will test this direction and see too. Thank you! Darren On Wed, 2009-01-21 at 09:41 -0500, Zak, Richard [USA] wrote: > You can do that. I did a Map/Reduce job for about 6 GB of PDFs to > concatenate them, and the New York times used Hadoop to process a few TB of > PDFs. > > What I would do is this: > - Use the iText library, a Java library for PDF manipulation (don't know > what you would use for reading Word docs) > - Don't use any Reducers > - Have the input be a text file with the directory(ies) to process, since > the mapper takes in file contents (and you don't want to read in one line of > binary) > - Have the map process all contents for that one given directory from the > input text file > - Break down the documents into more directories to go easier on the memory > - Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/ > (there is a script which passes environment variables to launched instances, > modify the script to allow Hadoop to use more memory by setting the > HADOOP_HEAPSIZE environment variable and having the variable properly > passed) > > I realize this isn't the strong point of Map/Reduce or Hadoop, but it still > uses the HDFS in a beneficial manner, and the distributed part is very > helpful! > > > Richard J. Zak > > -Original Message- > From: Darren Govoni [mailto:dar...@ontrenet.com] > Sent: Wednesday, January 21, 2009 08:08 > To: core-user@hadoop.apache.org > Subject: Suitable for Hadoop? > > Hi, > I have a task to process large quantities of files by converting them into > other formats. Each file is processed as a whole and converted to a target > format. Since there are 100's of GB of data I thought it suitable for > Hadoop, but the problem is, I don't think the files can be broken apart and > processed. For example, how would mapreduce work to convert a Word Document > to PDF if the file is reduced to blocks? I'm not sure that's possible, or is > it? > > thanks for any advice > Darren >
RE: Suitable for Hadoop?
Hmmm ... >From a space efficiency perspective, given HDFS (with large block size) is >expecting large files, is Hadoop optimized for processing large number of >small files ? Does each file take up at least 1 block ? or multiple files can >sit on the same block. Rgds, Ricky -Original Message- From: Zak, Richard [USA] [mailto:zak_rich...@bah.com] Sent: Wednesday, January 21, 2009 6:42 AM To: core-user@hadoop.apache.org Subject: RE: Suitable for Hadoop? You can do that. I did a Map/Reduce job for about 6 GB of PDFs to concatenate them, and the New York times used Hadoop to process a few TB of PDFs. What I would do is this: - Use the iText library, a Java library for PDF manipulation (don't know what you would use for reading Word docs) - Don't use any Reducers - Have the input be a text file with the directory(ies) to process, since the mapper takes in file contents (and you don't want to read in one line of binary) - Have the map process all contents for that one given directory from the input text file - Break down the documents into more directories to go easier on the memory - Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/ (there is a script which passes environment variables to launched instances, modify the script to allow Hadoop to use more memory by setting the HADOOP_HEAPSIZE environment variable and having the variable properly passed) I realize this isn't the strong point of Map/Reduce or Hadoop, but it still uses the HDFS in a beneficial manner, and the distributed part is very helpful! Richard J. Zak -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Wednesday, January 21, 2009 08:08 To: core-user@hadoop.apache.org Subject: Suitable for Hadoop? Hi, I have a task to process large quantities of files by converting them into other formats. Each file is processed as a whole and converted to a target format. Since there are 100's of GB of data I thought it suitable for Hadoop, but the problem is, I don't think the files can be broken apart and processed. For example, how would mapreduce work to convert a Word Document to PDF if the file is reduced to blocks? I'm not sure that's possible, or is it? thanks for any advice Darren
RE: Suitable for Hadoop?
You can do that. I did a Map/Reduce job for about 6 GB of PDFs to concatenate them, and the New York times used Hadoop to process a few TB of PDFs. What I would do is this: - Use the iText library, a Java library for PDF manipulation (don't know what you would use for reading Word docs) - Don't use any Reducers - Have the input be a text file with the directory(ies) to process, since the mapper takes in file contents (and you don't want to read in one line of binary) - Have the map process all contents for that one given directory from the input text file - Break down the documents into more directories to go easier on the memory - Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/ (there is a script which passes environment variables to launched instances, modify the script to allow Hadoop to use more memory by setting the HADOOP_HEAPSIZE environment variable and having the variable properly passed) I realize this isn't the strong point of Map/Reduce or Hadoop, but it still uses the HDFS in a beneficial manner, and the distributed part is very helpful! Richard J. Zak -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Wednesday, January 21, 2009 08:08 To: core-user@hadoop.apache.org Subject: Suitable for Hadoop? Hi, I have a task to process large quantities of files by converting them into other formats. Each file is processed as a whole and converted to a target format. Since there are 100's of GB of data I thought it suitable for Hadoop, but the problem is, I don't think the files can be broken apart and processed. For example, how would mapreduce work to convert a Word Document to PDF if the file is reduced to blocks? I'm not sure that's possible, or is it? thanks for any advice Darren smime.p7s Description: S/MIME cryptographic signature
Re: Why does Hadoop need ssh access to master and slaves?
Hi Matthias, It is not necessary to have SSH set up to run Hadoop, but it does make things easier. SSH is used by the scripts in the bin directory which start and stop daemons across the cluster (the slave nodes are defined in the slaves file), see the start-all.sh script as a starting point. These scripts are a convenient way to control Hadoop, but there are other possibilities. If you had another system to control daemons on your cluster then you wouldn't need SSH. Tom On Wed, Jan 21, 2009 at 1:20 PM, Matthias Scherer wrote: > Hi Steve and Amit, > > Thanks for your answers. I agree with you that key-based ssh is nothing to > worry about. But I'm wondering what exactly - that means wich grid > administration tasks - hadoop does via ssh?! Does it restart crashed data > nodes or tasks trackers on the slaves? Oder does it transfer data over the > grid with ssh access? How can I find a short description what exactly hadoop > needs ssh for? The documentation says only that I have to configure it. > > Thanks & Regards > Matthias > > >> -Ursprüngliche Nachricht- >> Von: Steve Loughran [mailto:ste...@apache.org] >> Gesendet: Mittwoch, 21. Januar 2009 13:59 >> An: core-user@hadoop.apache.org >> Betreff: Re: Why does Hadoop need ssh access to master and slaves? >> >> Amit k. Saha wrote: >> > On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer >> > wrote: >> >> Hi all, >> >> >> >> we've made our first steps in evaluating hadoop. The setup >> of 2 VMs >> >> as a hadoop grid was very easy and works fine. >> >> >> >> Now our operations team wonders why hadoop has to be able >> to connect >> >> to the master and slaves via password-less ssh?! Can >> anyone give us >> >> an answer to this question? >> > >> > 1. There has to be a way to connect to the remote hosts- >> slaves and a >> > secondary master, and SSH is the secure way to do it 2. It >> has to be >> > password-less to enable automatic logins >> > >> >> SSH is *a * secure way to do it, but not the only way. Other >> management tools can bring up hadoop clusters. Hadoop ships >> with scripted support for SSH as it is standard with Linux >> distros and generally the best way to bring up a remote console. >> >> Matthias, >> Your ops team should not be worrying about the SSH security, >> as long as they keep their keys under control. >> >> (a) Key-based SSH is more secure than passworded SSH, as >> man-in-middle attacks are prevented. passphrase protected SSH >> keys on external USB keys even better. >> >> (b) once the cluster is up, that filesystem is pretty >> vulnerable to anything on the LAN. You do need to lock down >> your datacentre, or set up the firewall/routing of the >> servers so that only trusted hosts can talk to the FS. SSH >> becomes a detail at that point. >> >> >> >
Re: AW: Why does Hadoop need ssh access to master and slaves?
Matthias Scherer wrote: Hi Steve and Amit, Thanks for your answers. I agree with you that key-based ssh is nothing to worry about. But I'm wondering what exactly - that means wich grid administration tasks - hadoop does via ssh?! Does it restart crashed data nodes or tasks trackers on the slaves? Oder does it transfer data over the grid with ssh access? How can I find a short description what exactly hadoop needs ssh for? The documentation says only that I have to configure it. Thanks & Regards Matthias SSH is used by the various scripts in bin/ to start and stop clusters, slaves.sh does the work, the other ones (like hadoop-daemons.sh) use it to run stuff on the machines. The EC2 scripts use SSH to talk to the machines brought up there; when you ask amazon for machines, you give it a public key to be set to the allowed keys list of root; you use that to ssh in and run code. There is currently no liveness/restarting built into the scripts; you need other things to do that. I am working on this, with HADOOP-3628, https://issues.apache.org/jira/browse/HADOOP-3628 I will be showing some other management options at ApacheCon EU 2009, which being on the same continent and timezone is something you may want to consider attending; lots of Hadoop people will be there, with some all-day sessions on it. http://eu.apachecon.com/c/aceu2009/sessions/227 One big problem with cluster management is not just recognising failed nodes, it's handling them. The actions you take are different with a VM-cluster like EC2 (fix: reboot, then kill that AMI and create a new one), from that of a VM-ware/Xen-managed cluster, to that of physical systems (Y!: phone Allen, us: email paolo). Once we have the health monitoring in there different people will need to apply their own policies. -steve -- Steve Loughran http://www.1060.org/blogxter/publish/5
AW: Why does Hadoop need ssh access to master and slaves?
Hi Steve and Amit, Thanks for your answers. I agree with you that key-based ssh is nothing to worry about. But I'm wondering what exactly - that means wich grid administration tasks - hadoop does via ssh?! Does it restart crashed data nodes or tasks trackers on the slaves? Oder does it transfer data over the grid with ssh access? How can I find a short description what exactly hadoop needs ssh for? The documentation says only that I have to configure it. Thanks & Regards Matthias > -Ursprüngliche Nachricht- > Von: Steve Loughran [mailto:ste...@apache.org] > Gesendet: Mittwoch, 21. Januar 2009 13:59 > An: core-user@hadoop.apache.org > Betreff: Re: Why does Hadoop need ssh access to master and slaves? > > Amit k. Saha wrote: > > On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer > > wrote: > >> Hi all, > >> > >> we've made our first steps in evaluating hadoop. The setup > of 2 VMs > >> as a hadoop grid was very easy and works fine. > >> > >> Now our operations team wonders why hadoop has to be able > to connect > >> to the master and slaves via password-less ssh?! Can > anyone give us > >> an answer to this question? > > > > 1. There has to be a way to connect to the remote hosts- > slaves and a > > secondary master, and SSH is the secure way to do it 2. It > has to be > > password-less to enable automatic logins > > > > SSH is *a * secure way to do it, but not the only way. Other > management tools can bring up hadoop clusters. Hadoop ships > with scripted support for SSH as it is standard with Linux > distros and generally the best way to bring up a remote console. > > Matthias, > Your ops team should not be worrying about the SSH security, > as long as they keep their keys under control. > > (a) Key-based SSH is more secure than passworded SSH, as > man-in-middle attacks are prevented. passphrase protected SSH > keys on external USB keys even better. > > (b) once the cluster is up, that filesystem is pretty > vulnerable to anything on the LAN. You do need to lock down > your datacentre, or set up the firewall/routing of the > servers so that only trusted hosts can talk to the FS. SSH > becomes a detail at that point. > > >
Suitable for Hadoop?
Hi, I have a task to process large quantities of files by converting them into other formats. Each file is processed as a whole and converted to a target format. Since there are 100's of GB of data I thought it suitable for Hadoop, but the problem is, I don't think the files can be broken apart and processed. For example, how would mapreduce work to convert a Word Document to PDF if the file is reduced to blocks? I'm not sure that's possible, or is it? thanks for any advice Darren
Re: Why does Hadoop need ssh access to master and slaves?
Amit k. Saha wrote: On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer wrote: Hi all, we've made our first steps in evaluating hadoop. The setup of 2 VMs as a hadoop grid was very easy and works fine. Now our operations team wonders why hadoop has to be able to connect to the master and slaves via password-less ssh?! Can anyone give us an answer to this question? 1. There has to be a way to connect to the remote hosts- slaves and a secondary master, and SSH is the secure way to do it 2. It has to be password-less to enable automatic logins SSH is *a * secure way to do it, but not the only way. Other management tools can bring up hadoop clusters. Hadoop ships with scripted support for SSH as it is standard with Linux distros and generally the best way to bring up a remote console. Matthias, Your ops team should not be worrying about the SSH security, as long as they keep their keys under control. (a) Key-based SSH is more secure than passworded SSH, as man-in-middle attacks are prevented. passphrase protected SSH keys on external USB keys even better. (b) once the cluster is up, that filesystem is pretty vulnerable to anything on the LAN. You do need to lock down your datacentre, or set up the firewall/routing of the servers so that only trusted hosts can talk to the FS. SSH becomes a detail at that point.
Re: Why does Hadoop need ssh access to master and slaves?
On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer wrote: > Hi all, > > we've made our first steps in evaluating hadoop. The setup of 2 VMs as a > hadoop grid was very easy and works fine. > > Now our operations team wonders why hadoop has to be able to connect to > the master and slaves via password-less ssh?! Can anyone give us an > answer to this question? 1. There has to be a way to connect to the remote hosts- slaves and a secondary master, and SSH is the secure way to do it 2. It has to be password-less to enable automatic logins -Amit > > Thanks & Regards > Matthias > -- Amit Kumar Saha http://amitksaha.blogspot.com http://amitsaha.in.googlepages.com/ *Bangalore Open Java Users Group*:http:www.bojug.in
Why does Hadoop need ssh access to master and slaves?
Hi all, we've made our first steps in evaluating hadoop. The setup of 2 VMs as a hadoop grid was very easy and works fine. Now our operations team wonders why hadoop has to be able to connect to the master and slaves via password-less ssh?! Can anyone give us an answer to this question? Thanks & Regards Matthias
Re: Null Pointer with Pattern file
Hi, Try to use: conf.setJarByClass(EchoOche.class); // conf is the JobConf instance of your example. Hope this helps, Rasit 2009/1/20 Shyam Sarkar > Hi, > > I was trying to run Hadoop wordcount version 2 example under Cygwin. I > tried > without pattern.txt file -- It works fine. > I tried with pattern.txt file to skip some patterns, I get NULL POINTER > exception as follows:: > > 09/01/20 12:56:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with > processName=JobTracker, sessionId= > 09/01/20 12:56:17 WARN mapred.JobClient: No job jar file set. User classes > may not be found. See JobConf(Class) or JobConf#setJar(String). > 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process > : 4 > 09/01/20 12:56:17 INFO mapred.JobClient: Running job: job_local_0001 > 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process > : 4 > 09/01/20 12:56:17 INFO mapred.MapTask: numReduceTasks: 1 > 09/01/20 12:56:17 INFO mapred.MapTask: io.sort.mb = 100 > 09/01/20 12:56:17 INFO mapred.MapTask: data buffer = 79691776/99614720 > 09/01/20 12:56:17 INFO mapred.MapTask: record buffer = 262144/327680 > 09/01/20 12:56:17 WARN mapred.LocalJobRunner: job_local_0001 > java.lang.NullPointerException > at org.myorg.WordCount$Map.configure(WordCount.java:39) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) > at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) > at org.myorg.WordCount.run(WordCount.java:114) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.myorg.WordCount.main(WordCount.java:119) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:165) > at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) > > > Please tell me what should I do. > > Thanks, > shyam.s.sar...@gmail.com > -- M. Raşit ÖZDAŞ
Re: streaming split sizes
Hi Dmitry, What version of hadoop are you using? Assuming your 3G DB is a read only lookup... can you load it into memory in the Map.configure and then use (0.19+ only...): mapred.job.reuse.jvm.num.tasks -1 So that the Maps are reused for all time and the in memory DB is not reloaded? This is what I did for cross referencing Point data (the input) with in-memory index of Polygons (the lookup DB) to find which polygons the points fell in (http://biodivertido.blogspot.com/2008/11/reproducing-spatial-joins-using-hadoop.html). Cheers, Tim On Wed, Jan 21, 2009 at 4:07 AM, Dmitry Pushkarev wrote: > Well, database is specifically designed to fit into memory and if it is not > it will slow things down hundreds of time. One simple hack I came to is to > replace map tasks by /bin/cat and then run 150 reducers that will have > database constantly in memory. Parallelism is also not a problems, since > we're running very small (15 nodes, 120 cores) specifically built for the > task. > > --- > Dmitry Pushkarev > +1-650-644-8988 > > -Original Message- > From: Delip Rao [mailto:delip...@gmail.com] > Sent: Tuesday, January 20, 2009 6:19 PM > To: core-user@hadoop.apache.org > Subject: Re: streaming split sizes > > Hi Dmitry, > > Not a direct answer to your question but I think the right approach > would be to not load your database into memory during config() but > instead lookup the database from map() via Hbase or something similar. > That way you don't have to worry about the split sizes. In fact using > fewer splits would limit the parallelism you can achieve, given that > your maps are so fast. > > - delip > > On Tue, Jan 20, 2009 at 8:25 PM, Dmitry Pushkarev wrote: >> Hi. >> >> >> >> I'm running streaming on relatively big (2Tb) dataset, which is being > split >> by hadoop in 64mb pieces. One of the problems I have with that is my map >> tasks take very long time to initialize (they need to load 3GB database > into >> RAM) and they are finishing these 64mb in 10 seconds. >> >> >> >> So I'm wondering if there is any way to make hadoop give larger datasets > to >> map jobs? (trivial way, of course would be to split dataset to N files and >> make it feed one file at a time, but is there any standard solution for >> that?) >> >> >> >> Thanks. >> >> >> >> --- >> >> Dmitry Pushkarev >> >> +1-650-644-8988 >> >> >> >> > >