Intermittent "Already Being Created Exception"
Hi all, I have a 50 node cluster and I am trying to write some logs of size 1GB each into hdfs. I need to write them in temporal fashion say for every 15 mins worth of data, I am closing previously opened file and creating a new file. The snippet of code is if() { if(out != null) { out.close(); } String outFileStr = outputDir+ File.separator + outDate + File.separator + outFileSuffix + "." + outputMinute; System.out.println("Creating outFileStr:"+ outFileStr); Path outFile = new Path(outFileStr); out = fs.create(outFile); //It throws exception here saying Already Being Created } When I run this code, I am getting Intermittent "Already Being Created Exceptions". I am not having any threads and this code is running sequentially. I went thru the previous mailing list posts but couldn't get much information. Can anyone please tell me why this is happening and how to avoid this? Thanks Pallavi
Can I have Reducer with No Output?
Hi, I have maps that do most of the work, and they output the data into a reducer, so basically key is a constant, and the reducer combines all the input from maps into a file and it does "LOAD_DATA" the file into mysql db. So, there won't be any output.collect ( ) in reducer function. But when I write the class Reducer, the compiler keeps complaining about the missing 2 types at the end of in "private static class Reducer extends MapReduceBase implements Reducer " which doesn't have output.collect ( ). What should I put in there? Do I even need Reducer at all? I think that only one reducer saves the file will be more efficient, rather than having all the map to save data into same file individually. Any better way to do this? Thanks. -- View this message in context: http://www.nabble.com/Can-I-have-Reducer-with-No-Output--tp23756950p23756950.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Can I have Reducer with No Output?
Yes you can do this. It is complaining because you are not declaring the output types in the method signature, but you will not use them anyway. So please try private static class Reducer extends MapReduceBase implements Reducer { ... The output format will be a TextOutputFormat, but it will not do anything. Cheers Tim On Thu, May 28, 2009 at 9:57 AM, dealmaker wrote: > > Hi, > I have maps that do most of the work, and they output the data into a > reducer, so basically key is a constant, and the reducer combines all the > input from maps into a file and it does "LOAD_DATA" the file into mysql db. > So, there won't be any output.collect ( ) in reducer function. But when I > write the class Reducer, the compiler keeps complaining about the missing 2 > types at the end of in "private static class Reducer extends MapReduceBase > implements Reducer " which doesn't have output.collect ( ). > What should I put in there? > > Do I even need Reducer at all? I think that only one reducer saves the file > will be more efficient, rather than having all the map to save data into > same file individually. Any better way to do this? > > Thanks. > -- > View this message in context: > http://www.nabble.com/Can-I-have-Reducer-with-No-Output--tp23756950p23756950.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
Re: Can I have Reducer with No Output?
If your reducer does not write anything, you could look at NullOutputFormat as well. Jothi On 5/28/09 1:38 PM, "tim robertson" wrote: > Yes you can do this. > > It is complaining because you are not declaring the output types in > the method signature, but you will not use them anyway. > > So please try > > private static class Reducer extends MapReduceBase implements > Reducer { > ... > > The output format will be a TextOutputFormat, but it will not do anything. > > Cheers > > Tim > > > > On Thu, May 28, 2009 at 9:57 AM, dealmaker wrote: >> >> Hi, >> I have maps that do most of the work, and they output the data into a >> reducer, so basically key is a constant, and the reducer combines all the >> input from maps into a file and it does "LOAD_DATA" the file into mysql db. >> So, there won't be any output.collect ( ) in reducer function. But when I >> write the class Reducer, the compiler keeps complaining about the missing 2 >> types at the end of in "private static class Reducer extends MapReduceBase >> implements Reducer " which doesn't have output.collect ( ). >> What should I put in there? >> >> Do I even need Reducer at all? I think that only one reducer saves the file >> will be more efficient, rather than having all the map to save data into >> same file individually. Any better way to do this? >> >> Thanks. >> -- >> View this message in context: >> http://www.nabble.com/Can-I-have-Reducer-with-No-Output--tp23756950p23756950. >> html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >>
Re: hadoop hardware configuration
Patrick Angeles wrote: Sorry for cross-posting, I realized I sent the following to the hbase list when it's really more a Hadoop question. This is an interesting question. Obviously as an HP employee you must assume that I'm biased when I say HP DL160 servers are good value for the workers, though our blade systems are very good for a high physical density -provided you have the power to fill up the rack. 2 x Hadoop Master (and Secondary NameNode) - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W) - 16GB DDR2-800 Registered ECC Memory - 4 x 1TB 7200rpm SATA II Drives - Hardware RAID controller - Redundant Power Supply - Approx. 390W power draw (1.9amps 208V) - Approx. $4000 per unit I do not know the what the advantages of that many cores are on a NN. Someone needs to do some experiments. I do know you need enough RAM to hold the index in memory, and you may want to go for a bigger block size to keep the index size down. 6 x Hadoop Task Nodes - 1 x 2.3Ghz Quad Core (Opteron 1356) - 8GB DDR2-800 Registered ECC Memory - 4 x 1TB 7200rpm SATA II Drives - No RAID (JBOD) - Non-Redundant Power Supply - Approx. 210W power draw (1.0amps 208V) - Approx. $2000 per unit I had some specific questions regarding this configuration... 1. Is hardware RAID necessary for the master node? You need a good story to ensure that loss of a disk on the master doesn't lose the filesystem. I like RAID there, but the alternative is to push the stuff out over the network to other storage you trust. That could be NFS-mounted RAID storage, it could be NFS mounted JBOD. Whatever your chosen design, test it works before you go live by running the cluster then simulate different failures, see how well the hardware/ops team handles it. Keep an eye on where that data goes, because when the NN runs out of file storage, the consequences can be pretty dramatic (i,e the cluster doesnt come up unless you edit the editlog by hand) 2. What is a good processor-to-storage ratio for a task node with 4TB of raw storage? (The config above has 1 core per 1TB of raw storage.) That really depends on the work you are doing...the bytes in/out to CPU work, and the size of any memory structures that are built up over the run. With 1 core per physical disk, you get the bandwidth of a single disk per CPU; for some IO-intensive work you can make the case for two disks/CPU -one in, one out, but then you are using more power, and if/when you want to add more storage, you have to pull out the disks to stick in new ones. If you go for more CPUs, you will probably need more RAM to go with it. 3. Am I better off using dual quads for a task node, with a higher power draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200, but draws almost 2x as much power. The tradeoffs are: 1. I will get more CPU per dollar and per watt. 2. I will only be able to fit 1/2 as much dual quad machines into a rack. 3. I will get 1/2 the storage capacity per watt. 4. I will get less I/O throughput overall (less spindles per core) First there is the algorithm itself, and whether you are IO or CPU bound. Most MR jobs that I've encountered are fairly IO bound -without indexes, every lookup has to stream through all the data, so it's power inefficient and IO limited. but if you are trying to do higher level stuff than just lookup, then you will be doing more CPU-work Then there is the question of where your electricity comes from, what the limits for the room are, whether you are billed on power drawn or quoted PSU draw, what the HVAC limits are, what the maximum allowed weight per rack is, etc, etc. I'm a fan of low Joule work, though we don't have any benchmarks yet of the power efficiency of different clusters; the number of MJ used to do a a terasort. I'm debating doing some single-cpu tests for this on my laptop, as the battery knows how much gets used up by some work. 4. In planning storage capacity, how much spare disk space should I take into account for 'scratch'? For now, I'm assuming 1x the input data size. That you should probably be able to determine on experimental work on smaller datasets. Some maps can throw out a lot of data, most reduces do actually reduce the final amount. -Steve (Disclaimer: I'm not making any official recommendations for hardware here, just making my opinions known. If you do want an official recommendation from HP, talk to your reseller or account manager, someone will look at your problem in more detail and make some suggestions. If you have any code/data that could be shared for benchmarking, that would help validate those suggestions)
Issue with usage of fs -test
Hello, I am facing a strange issue, where in the /fs -test -e/ fails and /fs -ls/ succeeds to list the file. Following is the grep of such a result : bin]$ hadoop fs -ls /projects/myproject///.done Found 1 items -rw--- 3 user hdfs 0 2009-03-19 22:28 /projects/myproject///.done [...@mymachine bin]$ echo $? 0 [...@mymachine bin]$ hadoop fs -test -e /projects/myproject///.done [...@mymachine bin]$ echo $? 1 What is the cause of such a behaviour, any pointers would much be appreciated. (HADOOP_CONF_DIR and HADOOP_HOME are set correctly at env vars) Thanks Pankaj
RE: Issue with usage of fs -test
Maybe https://issues.apache.org/jira/browse/HADOOP-3792 ? Koji -Original Message- From: pankaj jairath [mailto:pjair...@yahoo-inc.com] Sent: Thursday, May 28, 2009 4:49 AM To: core-user@hadoop.apache.org Subject: Issue with usage of fs -test Hello, I am facing a strange issue, where in the /fs -test -e/ fails and /fs -ls/ succeeds to list the file. Following is the grep of such a result : bin]$ hadoop fs -ls /projects/myproject///.done Found 1 items -rw--- 3 user hdfs 0 2009-03-19 22:28 /projects/myproject///.done [...@mymachine bin]$ echo $? 0 [...@mymachine bin]$ hadoop fs -test -e /projects/myproject///.done [...@mymachine bin]$ echo $? 1 What is the cause of such a behaviour, any pointers would much be appreciated. (HADOOP_CONF_DIR and HADOOP_HOME are set correctly at env vars) Thanks Pankaj
Efficient algorithm for many-to-many reduce-side join?
I need to do a reduce-side join of two datasets. It's a many-to-many join; that is, each dataset can can multiple records with any given key. Every description of a reduce-side join I've seen involves constructing your keys out of your mapper such that records from one dataset will be presented to the reducers before records from the second dataset. I should "hold on" to the value from the one dataset and remember it as I iterate across the values from the second dataset. This seems like it only works well for one-to-many joins (when one of your datasets will only have a single record with any given key). This scales well because you're only remembering one value. In a many-to-many join, if you apply this same algorithm, you'll need to remember all values from one dataset, which of course will be problematic (and won't scale) when dealing with large datasets with large numbers of records with the same keys. Does an efficient algorithm exist for a many-to-many reduce-side join?
InputFormat for fixed-width records?
I need to process a dataset that contains text records of fixed length in bytes. For example, each record may be 100 bytes in length, with the first field being the first 10 bytes, the second field being the second 10 bytes, etc... There are no newlines on the file. Field values have been either whitespace-padded or truncated to fit within the specific locations in these fixed-width records. Does Hadoop have an InputFormat to support processing of such files? I looked but couldn't find one. Of course, I could pre-process the file (outside of Hadoop) to put newlines at the end of each record, but I'd prefer not to require such a prep step. Thanks.
Re: Issue with usage of fs -test
Thanks, Koji. This is the issue I am facing and I have been using version 0.18.x. -/Pankaj Koji Noguchi wrote: Maybe https://issues.apache.org/jira/browse/HADOOP-3792 ? Koji -Original Message- From: pankaj jairath [mailto:pjair...@yahoo-inc.com] Sent: Thursday, May 28, 2009 4:49 AM To: core-user@hadoop.apache.org Subject: Issue with usage of fs -test Hello, I am facing a strange issue, where in the /fs -test -e/ fails and /fs -ls/ succeeds to list the file. Following is the grep of such a result : bin]$ hadoop fs -ls /projects/myproject///.done Found 1 items -rw--- 3 user hdfs 0 2009-03-19 22:28 /projects/myproject///.done [...@mymachine bin]$ echo $? 0 [...@mymachine bin]$ hadoop fs -test -e /projects/myproject///.done [...@mymachine bin]$ echo $? 1 What is the cause of such a behaviour, any pointers would much be appreciated. (HADOOP_CONF_DIR and HADOOP_HOME are set correctly at env vars) Thanks Pankaj
Re: InputFormat for fixed-width records?
Hi Stuart, There isn't an InputFormat that comes with Hadoop to do this. Rather than pre-processing the file, it would be better to implement your own InputFormat. Subclass FileInputFormat and provide an implementation of getRecordReader() that returns your implementation of RecordReader to read fixed width records. In the next() method you would do something like: byte[] buf = new byte[100]; IOUtils.readFully(in, buf, pos, 100); pos += 100; You would also need to check for the end of the stream. See LineRecordReader for some ideas. You'll also have to handle finding the start of records for a split, which you can do by looking at the offset and seeking to the next multiple of 100. If the RecordReader was a RecordReader (no keys) then it would return each record as a byte array to the mapper, which would then break it into fields. Alternatively, you could do it in the RecordReader, and use your own type which encapsulates the fields for the value. Hope this helps. Cheers, Tom On Thu, May 28, 2009 at 1:15 PM, Stuart White wrote: > I need to process a dataset that contains text records of fixed length > in bytes. For example, each record may be 100 bytes in length, with > the first field being the first 10 bytes, the second field being the > second 10 bytes, etc... There are no newlines on the file. Field > values have been either whitespace-padded or truncated to fit within > the specific locations in these fixed-width records. > > Does Hadoop have an InputFormat to support processing of such files? > I looked but couldn't find one. > > Of course, I could pre-process the file (outside of Hadoop) to put > newlines at the end of each record, but I'd prefer not to require such > a prep step. > > Thanks. >
Re: Appending to a file / updating a file
append isn't supported without modifying the configuration file for hadoop. check out the mailling list threads ... i've sent a post in the past explaining how to enable it. On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja wrote: > Hello, > > I'm trying hadoop for the first time and I'm just trying to create a file > and append some text in it with the following code: > > > import java.io.IOException; > > import org.apache.hadoop.conf. Configuration; > import org.apache.hadoop.fs.FSDataOutputStream; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.Path; > > /** > * @author olivier > * > */ > public class HadoopIO { > > public static void main(String[] args) throws IOException { > > > String directory = "/Users/olivier/tmp/hadoop-data"; > Configuration conf = new Configuration(true); > Path path = new Path(directory); > // Create the File system > FileSystem fs = path.getFileSystem(conf); > // Sets the working directory > fs.setWorkingDirectory(path); > > System.out.println(fs.getWorkingDirectory()); > > // Creates a files > FSDataOutputStream out = fs.create(new Path("test.txt")); > out.writeBytes("Testing hadoop - first line"); > out.close(); > // then try to append something > out = fs.append(new Path("test.txt")); > out.writeBytes("Testing hadoop - second line"); > out.close(); > > fs.close(); > > > } > > } > > > but I receive the following exception: > > Exception in thread "main" java.io.IOException: Not supported > at > org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) > at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38) > > > 1) Can someone tell me what am i doing wrong? > > 2) How can I update the file (for example, just update the first 10 bytes of > the file)? > > > Thanks, > Olivier > -- Sasha Dolgy sasha.do...@gmail.com
Re: Appending to a file / updating a file
Hi Sacha! Thanks for the quick answer. Is there a simple way to search the mailing list? by text or by author. At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see a browse per month... Thanks, Olivier On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy wrote: > append isn't supported without modifying the configuration file for > hadoop. check out the mailling list threads ... i've sent a post in > the past explaining how to enable it. > > On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja wrote: > > Hello, > > > > I'm trying hadoop for the first time and I'm just trying to create a file > > and append some text in it with the following code: > > > > > > import java.io.IOException; > > > > import org.apache.hadoop.conf. Configuration; > > import org.apache.hadoop.fs.FSDataOutputStream; > > import org.apache.hadoop.fs.FileSystem; > > import org.apache.hadoop.fs.Path; > > > > /** > > * @author olivier > > * > > */ > > public class HadoopIO { > > > >public static void main(String[] args) throws IOException { > > > > > >String directory = "/Users/olivier/tmp/hadoop-data"; > >Configuration conf = new Configuration(true); > >Path path = new Path(directory); > >// Create the File system > >FileSystem fs = path.getFileSystem(conf); > >// Sets the working directory > >fs.setWorkingDirectory(path); > > > >System.out.println(fs.getWorkingDirectory()); > > > >// Creates a files > >FSDataOutputStream out = fs.create(new Path("test.txt")); > >out.writeBytes("Testing hadoop - first line"); > >out.close(); > >// then try to append something > >out = fs.append(new Path("test.txt")); > >out.writeBytes("Testing hadoop - second line"); > >out.close(); > > > >fs.close(); > > > > > >} > > > > } > > > > > > but I receive the following exception: > > > > Exception in thread "main" java.io.IOException: Not supported > >at > > > org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290) > >at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) > >at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38) > > > > > > 1) Can someone tell me what am i doing wrong? > > > > 2) How can I update the file (for example, just update the first 10 bytes > of > > the file)? > > > > > > Thanks, > > Olivier > > > > > > -- > Sasha Dolgy > sasha.do...@gmail.com >
Re: Appending to a file / updating a file
http://www.mail-archive.com/core-user@hadoop.apache.org/msg10002.html On Thu, May 28, 2009 at 3:03 PM, Olivier Smadja wrote: > Hi Sacha! > > Thanks for the quick answer. Is there a simple way to search the mailing > list? by text or by author. > > At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see a > browse per month... > > Thanks, > Olivier > > > On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy wrote: > >> append isn't supported without modifying the configuration file for >> hadoop. check out the mailling list threads ... i've sent a post in >> the past explaining how to enable it. >> >> On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja wrote: >> > Hello, >> > >> > I'm trying hadoop for the first time and I'm just trying to create a file >> > and append some text in it with the following code: >> > >> > >> > import java.io.IOException; >> > >> > import org.apache.hadoop.conf. Configuration; >> > import org.apache.hadoop.fs.FSDataOutputStream; >> > import org.apache.hadoop.fs.FileSystem; >> > import org.apache.hadoop.fs.Path; >> > >> > /** >> > * @author olivier >> > * >> > */ >> > public class HadoopIO { >> > >> > public static void main(String[] args) throws IOException { >> > >> > >> > String directory = "/Users/olivier/tmp/hadoop-data"; >> > Configuration conf = new Configuration(true); >> > Path path = new Path(directory); >> > // Create the File system >> > FileSystem fs = path.getFileSystem(conf); >> > // Sets the working directory >> > fs.setWorkingDirectory(path); >> > >> > System.out.println(fs.getWorkingDirectory()); >> > >> > // Creates a files >> > FSDataOutputStream out = fs.create(new Path("test.txt")); >> > out.writeBytes("Testing hadoop - first line"); >> > out.close(); >> > // then try to append something >> > out = fs.append(new Path("test.txt")); >> > out.writeBytes("Testing hadoop - second line"); >> > out.close(); >> > >> > fs.close(); >> > >> > >> > } >> > >> > } >> > >> > >> > but I receive the following exception: >> > >> > Exception in thread "main" java.io.IOException: Not supported >> > at >> > >> org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290) >> > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) >> > at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38) >> > >> > >> > 1) Can someone tell me what am i doing wrong? >> > >> > 2) How can I update the file (for example, just update the first 10 bytes >> of >> > the file)? >> > >> > >> > Thanks, >> > Olivier >> > >> >> >> >> -- >> Sasha Dolgy >> sasha.do...@gmail.com >> > -- Sasha Dolgy sasha.do...@gmail.com
Appending to a file / updating a file
Hello, I'm trying hadoop for the first time and I'm just trying to create a file and append some text in it with the following code: import java.io.IOException; import org.apache.hadoop.conf. Configuration; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; /** * @author olivier * */ public class HadoopIO { public static void main(String[] args) throws IOException { String directory = "/Users/olivier/tmp/hadoop-data"; Configuration conf = new Configuration(true); Path path = new Path(directory); // Create the File system FileSystem fs = path.getFileSystem(conf); // Sets the working directory fs.setWorkingDirectory(path); System.out.println(fs.getWorkingDirectory()); // Creates a files FSDataOutputStream out = fs.create(new Path("test.txt")); out.writeBytes("Testing hadoop - first line"); out.close(); // then try to append something out = fs.append(new Path("test.txt")); out.writeBytes("Testing hadoop - second line"); out.close(); fs.close(); } } but I receive the following exception: Exception in thread "main" java.io.IOException: Not supported at org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38) 1) Can someone tell me what am i doing wrong? 2) How can I update the file (for example, just update the first 10 bytes of the file)? Thanks, Olivier
Re: hadoop hardware configuration
On May 28, 2009, at 5:02 AM, Steve Loughran wrote: Patrick Angeles wrote: Sorry for cross-posting, I realized I sent the following to the hbase list when it's really more a Hadoop question. This is an interesting question. Obviously as an HP employee you must assume that I'm biased when I say HP DL160 servers are good value for the workers, though our blade systems are very good for a high physical density -provided you have the power to fill up the rack. :) As an HP employee, can you provide us with coupons? 2 x Hadoop Master (and Secondary NameNode) - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W) - 16GB DDR2-800 Registered ECC Memory - 4 x 1TB 7200rpm SATA II Drives - Hardware RAID controller - Redundant Power Supply - Approx. 390W power draw (1.9amps 208V) - Approx. $4000 per unit I do not know the what the advantages of that many cores are on a NN. Someone needs to do some experiments. I do know you need enough RAM to hold the index in memory, and you may want to go for a bigger block size to keep the index size down. Despite my trying, I've never been able to come even close to pegging the CPUs on our NN. I'd recommend going for the fastest dual-cores which are affordable -- latency is king. Of course, with the size of your cluster, I'd spend a little less money here and get more disk space. 6 x Hadoop Task Nodes - 1 x 2.3Ghz Quad Core (Opteron 1356) - 8GB DDR2-800 Registered ECC Memory - 4 x 1TB 7200rpm SATA II Drives - No RAID (JBOD) - Non-Redundant Power Supply - Approx. 210W power draw (1.0amps 208V) - Approx. $2000 per unit I had some specific questions regarding this configuration... 1. Is hardware RAID necessary for the master node? You need a good story to ensure that loss of a disk on the master doesn't lose the filesystem. I like RAID there, but the alternative is to push the stuff out over the network to other storage you trust. That could be NFS-mounted RAID storage, it could be NFS mounted JBOD. Whatever your chosen design, test it works before you go live by running the cluster then simulate different failures, see how well the hardware/ops team handles it. Keep an eye on where that data goes, because when the NN runs out of file storage, the consequences can be pretty dramatic (i,e the cluster doesnt come up unless you edit the editlog by hand) We do both -- push the disk image out to NFS and have a mirrored SAS hard drives on the namenode. The SAS drives appear to be overkill. 2. What is a good processor-to-storage ratio for a task node with 4TB of raw storage? (The config above has 1 core per 1TB of raw storage.) We're data hungry locally -- I'd put in bigger hard drives. The 1.5TB Seagate drives seem to have passed their teething issues, and are at a pretty sweet price point. They only will scale up to 60 IOPS, so make sure your workflows don't have lots of random I/O. As Steve mentions below, the rest is really up to your algorithm. Do you need 1 CPU second / byte? If so, buy more CPUs. Do you need .1 CPU second / MB? If so, buy more disks. Brian That really depends on the work you are doing...the bytes in/out to CPU work, and the size of any memory structures that are built up over the run. With 1 core per physical disk, you get the bandwidth of a single disk per CPU; for some IO-intensive work you can make the case for two disks/CPU -one in, one out, but then you are using more power, and if/when you want to add more storage, you have to pull out the disks to stick in new ones. If you go for more CPUs, you will probably need more RAM to go with it. 3. Am I better off using dual quads for a task node, with a higher power draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200, but draws almost 2x as much power. The tradeoffs are: 1. I will get more CPU per dollar and per watt. 2. I will only be able to fit 1/2 as much dual quad machines into a rack. 3. I will get 1/2 the storage capacity per watt. 4. I will get less I/O throughput overall (less spindles per core) First there is the algorithm itself, and whether you are IO or CPU bound. Most MR jobs that I've encountered are fairly IO bound - without indexes, every lookup has to stream through all the data, so it's power inefficient and IO limited. but if you are trying to do higher level stuff than just lookup, then you will be doing more CPU- work Then there is the question of where your electricity comes from, what the limits for the room are, whether you are billed on power drawn or quoted PSU draw, what the HVAC limits are, what the maximum allowed weight per rack is, etc, etc. I'm a fan of low Joule work, though we don't have any benchmarks yet of the power efficiency of different clusters; the number of MJ used to do a a terasort. I'm debating doing some single-cpu t
Re: SequenceFile and streaming
Hi Walter, On Thu, May 28, 2009 at 6:52 AM, walter steffe wrote: > Hello > I am a new user and I would like to use hadoop streaming with > SequenceFile in both input and output side. > > -The first difficoulty arises from the lack of a simple tool to generate > a SequenceFile starting from a set of files in a given directory. > I would like to have something similar to "tar -cvf file.tar foo/" > This should work also in the opposite direction like "tar -xvf file.tar" There's a tool for turning tar files into sequence files here: http://stuartsierra.com/2008/04/24/a-million-little-files > > -An other important feature that I would like to see is the possibility > to feed the mapper stdin with the whole content of a file (extracted > from the file SequenceFile) disregarding the key. Have a look at SequenceFileAsTextInputFormat which will do this for you (except the key is the sequence file's key). > Using each file as a tar archive I it would like to be able to do: > > $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ > -input "/user/me/inputSequenceFile" \ > -output "/user/me/outputSequenceFile" \ > -inputformat SequenceFile > -outputformat SequenceFile > -mapper myscript.sh > -reducer NONE > > myscrip.sh should work as a filter which takes its input from > stdin and put the output on stdout: > > tar -x > "do something on the generated dir and create an outputfile" > cat outputfile > > The output file should (automatically) go into the outputSequenceFile. > > I think that this would be a very usefull schema which fits well with > the mapreduce requirements on one side and with the unix commands on the > other side. It should not be too difficoult to implement the tools > needed for that. I totally agree - having more tools to better integrate sequence files and map files with unix tools would be very handy. Tom > > > Walter > > > > > > > >
Re: Appending to a file / updating a file
Thanks Sacha, I have now my hdfs-site.xml like that : (as the hadoop-site.xml seems to be deprecated) dfs.support.append true But I continue receiving the exception. Checking the hadoop source code, I saw public FSDataOutputStream append(Path f, int bufferSize, Progressable progress) throws IOException { throw new IOException("Not supported"); } in the class org.apache.hadoop.fs.ChecksumFileSystem and this is where the exception is thrown. my fileSystem is a LocalFileSystem instance that inherits the ChecksumFileSystem and the append method has not beem overriden. Whereas in the DistributedFileSystem, the append method is defined like this: /** This optional operation is not yet supported. */ public FSDataOutputStream append(Path f, int bufferSize, Progressable progress) throws IOException { DFSOutputStream op = (DFSOutputStream)dfs.append(getPathName(f), bufferSize, progress); return new FSDataOutputStream(op, statistics, op.getInitialLen()); } even if the comment still say it is not supported, it seems to do something So this makes me think that append is not supported on hadoop LocalFileSystem. Is it correct? Thanks, Olivier On Thu, May 28, 2009 at 11:06 AM, Sasha Dolgy wrote: > http://www.mail-archive.com/core-user@hadoop.apache.org/msg10002.html > > > On Thu, May 28, 2009 at 3:03 PM, Olivier Smadja wrote: > > Hi Sacha! > > > > Thanks for the quick answer. Is there a simple way to search the mailing > > list? by text or by author. > > > > At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see > a > > browse per month... > > > > Thanks, > > Olivier > > > > > > On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy wrote: > > > >> append isn't supported without modifying the configuration file for > >> hadoop. check out the mailling list threads ... i've sent a post in > >> the past explaining how to enable it. > >> > >> On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja > wrote: > >> > Hello, > >> > > >> > I'm trying hadoop for the first time and I'm just trying to create a > file > >> > and append some text in it with the following code: > >> > > >> > > >> > import java.io.IOException; > >> > > >> > import org.apache.hadoop.conf. Configuration; > >> > import org.apache.hadoop.fs.FSDataOutputStream; > >> > import org.apache.hadoop.fs.FileSystem; > >> > import org.apache.hadoop.fs.Path; > >> > > >> > /** > >> > * @author olivier > >> > * > >> > */ > >> > public class HadoopIO { > >> > > >> >public static void main(String[] args) throws IOException { > >> > > >> > > >> >String directory = "/Users/olivier/tmp/hadoop-data"; > >> >Configuration conf = new Configuration(true); > >> >Path path = new Path(directory); > >> >// Create the File system > >> >FileSystem fs = path.getFileSystem(conf); > >> >// Sets the working directory > >> >fs.setWorkingDirectory(path); > >> > > >> >System.out.println(fs.getWorkingDirectory()); > >> > > >> >// Creates a files > >> >FSDataOutputStream out = fs.create(new Path("test.txt")); > >> >out.writeBytes("Testing hadoop - first line"); > >> >out.close(); > >> >// then try to append something > >> >out = fs.append(new Path("test.txt")); > >> >out.writeBytes("Testing hadoop - second line"); > >> >out.close(); > >> > > >> >fs.close(); > >> > > >> > > >> >} > >> > > >> > } > >> > > >> > > >> > but I receive the following exception: > >> > > >> > Exception in thread "main" java.io.IOException: Not supported > >> >at > >> > > >> > org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290) > >> >at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) > >> >at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38) > >> > > >> > > >> > 1) Can someone tell me what am i doing wrong? > >> > > >> > 2) How can I update the file (for example, just update the first 10 > bytes > >> of > >> > the file)? > >> > > >> > > >> > Thanks, > >> > Olivier > >> > > >> > >> > >> > >> -- > >> Sasha Dolgy > >> sasha.do...@gmail.com > >> > > > > > > -- > Sasha Dolgy > sasha.do...@gmail.com >
Re: InputFormat for fixed-width records?
On May 28, 2009, at 5:15 AM, Stuart White wrote: I need to process a dataset that contains text records of fixed length in bytes. For example, each record may be 100 bytes in length The update to the terasort example has an InputFormat that does exactly that. The key is 10 bytes and the value is the next 90 bytes. It is pretty easy to write, but I should upload it soon. The output types are Text, but they just have the binary data in them. -- Owen
Re: Appending to a file / updating a file
did you restart hadoop? sorry i'm stuck in the middle of something so can't give this more attention. i can assure you however that we have append working in our POC ... and the code isn't that much different to what you have posted. -sd On Thu, May 28, 2009 at 3:31 PM, Olivier Smadja wrote: > Thanks Sacha, > > I have now my hdfs-site.xml like that : (as the hadoop-site.xml seems to be > deprecated) > > > > dfs.support.append > true > > > > But I continue receiving the exception. > > Checking the hadoop source code, I saw > > public FSDataOutputStream append(Path f, int bufferSize, > Progressable progress) throws IOException { > throw new IOException("Not supported"); > } > > in the class org.apache.hadoop.fs.ChecksumFileSystem and this is where the > exception is thrown. my fileSystem is a LocalFileSystem instance that > inherits the ChecksumFileSystem and the append method has not beem > overriden. > > Whereas in the DistributedFileSystem, the append method is defined like > this: > > /** This optional operation is not yet supported. */ > public FSDataOutputStream append(Path f, int bufferSize, > Progressable progress) throws IOException { > > DFSOutputStream op = (DFSOutputStream)dfs.append(getPathName(f), > bufferSize, progress); > return new FSDataOutputStream(op, statistics, op.getInitialLen()); > } > > even if the comment still say it is not supported, it seems to do > something > > So this makes me think that append is not supported on hadoop > LocalFileSystem. > > Is it correct? > > Thanks, > Olivier > > > > > > On Thu, May 28, 2009 at 11:06 AM, Sasha Dolgy wrote: > >> http://www.mail-archive.com/core-user@hadoop.apache.org/msg10002.html >> >> >> On Thu, May 28, 2009 at 3:03 PM, Olivier Smadja wrote: >> > Hi Sacha! >> > >> > Thanks for the quick answer. Is there a simple way to search the mailing >> > list? by text or by author. >> > >> > At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see >> a >> > browse per month... >> > >> > Thanks, >> > Olivier >> > >> > >> > On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy wrote: >> > >> >> append isn't supported without modifying the configuration file for >> >> hadoop. check out the mailling list threads ... i've sent a post in >> >> the past explaining how to enable it. >> >> >> >> On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja >> wrote: >> >> > Hello, >> >> > >> >> > I'm trying hadoop for the first time and I'm just trying to create a >> file >> >> > and append some text in it with the following code: >> >> > >> >> > >> >> > import java.io.IOException; >> >> > >> >> > import org.apache.hadoop.conf. Configuration; >> >> > import org.apache.hadoop.fs.FSDataOutputStream; >> >> > import org.apache.hadoop.fs.FileSystem; >> >> > import org.apache.hadoop.fs.Path; >> >> > >> >> > /** >> >> > * @author olivier >> >> > * >> >> > */ >> >> > public class HadoopIO { >> >> > >> >> > public static void main(String[] args) throws IOException { >> >> > >> >> > >> >> > String directory = "/Users/olivier/tmp/hadoop-data"; >> >> > Configuration conf = new Configuration(true); >> >> > Path path = new Path(directory); >> >> > // Create the File system >> >> > FileSystem fs = path.getFileSystem(conf); >> >> > // Sets the working directory >> >> > fs.setWorkingDirectory(path); >> >> > >> >> > System.out.println(fs.getWorkingDirectory()); >> >> > >> >> > // Creates a files >> >> > FSDataOutputStream out = fs.create(new Path("test.txt")); >> >> > out.writeBytes("Testing hadoop - first line"); >> >> > out.close(); >> >> > // then try to append something >> >> > out = fs.append(new Path("test.txt")); >> >> > out.writeBytes("Testing hadoop - second line"); >> >> > out.close(); >> >> > >> >> > fs.close(); >> >> > >> >> > >> >> > } >> >> > >> >> > } >> >> > >> >> > >> >> > but I receive the following exception: >> >> > >> >> > Exception in thread "main" java.io.IOException: Not supported >> >> > at >> >> > >> >> >> org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290) >> >> > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) >> >> > at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38) >> >> > >> >> > >> >> > 1) Can someone tell me what am i doing wrong? >> >> > >> >> > 2) How can I update the file (for example, just update the first 10 >> bytes >> >> of >> >> > the file)? >> >> > >> >> > >> >> > Thanks, >> >> > Olivier >> >> > >> >> >> >> >> >> >> >> -- >> >> Sasha Dolgy >> >> sasha.do...@gmail.com >> >> >> > >> >> >> >> -- >> Sasha Dolgy >> sasha.do...@gmail.com >> > -- Sasha Dolgy sasha.do...@gmail.com
Re: Efficient algorithm for many-to-many reduce-side join?
Hi Stuart, It seems to me like you have a few options. Option 1: Just use a lot of RAM. Unless you really expect many millions of entries on both sides of the join, you might be able to get away with buffering despite its inefficiency. Option 2: Use LocalDirAllocator to find some local storage to spill all of the left table's records to disk in a MapFile format. Then as you iterate over the right table, do lookups in the MapFile. This is really the same as option 1, except that you're using disk as an extension of RAM. Option 3: Convert this to a map-side merge join. Basically what you need to do is sort both tables by the join key, and partition them with the same partitioner into the same number of columns. This way you have an equal number of part-N files for both tables, and within each part-N file they're ordered by join key. In each map task, you open both tableA/part-N and tableB/part-N and do a sequential merge to perform the join. I believe the CompositeInputFormat class helps with this, though I've never used it. Option 4: Perform the join in several passes. Whichever table is smaller, break into pieces that fit in RAM. Unless my relational algebra is off, A JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1 UNION B2. Hope that helps -Todd On Thu, May 28, 2009 at 5:02 AM, Stuart White wrote: > I need to do a reduce-side join of two datasets. It's a many-to-many > join; that is, each dataset can can multiple records with any given > key. > > Every description of a reduce-side join I've seen involves > constructing your keys out of your mapper such that records from one > dataset will be presented to the reducers before records from the > second dataset. I should "hold on" to the value from the one dataset > and remember it as I iterate across the values from the second > dataset. > > This seems like it only works well for one-to-many joins (when one of > your datasets will only have a single record with any given key). > This scales well because you're only remembering one value. > > In a many-to-many join, if you apply this same algorithm, you'll need > to remember all values from one dataset, which of course will be > problematic (and won't scale) when dealing with large datasets with > large numbers of records with the same keys. > > Does an efficient algorithm exist for a many-to-many reduce-side join? >
Re: Efficient algorithm for many-to-many reduce-side join?
One last possible trick to consider: If you were to subclass SequenceFileRecordReader, you'd have access to its seek method, allowing you to rewind the reducer input. You could then implement a block hash join with something like the following pseudocode: ahash = new HashMap(); while (i have ram available) { read a record if the record is from table B, break put the record into ahash } nextAPos = reader.getPos() while (current record is an A record) { skip to next record } firstBPos = reader.getPos() while (current record has current key) { read and join against ahash process joined result } if firstBPos > nextAPos { seek(nextAPos) go back to top } On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon wrote: > Hi Stuart, > > It seems to me like you have a few options. > > Option 1: Just use a lot of RAM. Unless you really expect many millions of > entries on both sides of the join, you might be able to get away with > buffering despite its inefficiency. > > Option 2: Use LocalDirAllocator to find some local storage to spill all of > the left table's records to disk in a MapFile format. Then as you iterate > over the right table, do lookups in the MapFile. This is really the same as > option 1, except that you're using disk as an extension of RAM. > > Option 3: Convert this to a map-side merge join. Basically what you need to > do is sort both tables by the join key, and partition them with the same > partitioner into the same number of columns. This way you have an equal > number of part-N files for both tables, and within each part-N file > they're ordered by join key. In each map task, you open both tableA/part-N > and tableB/part-N and do a sequential merge to perform the join. I believe > the CompositeInputFormat class helps with this, though I've never used it. > > Option 4: Perform the join in several passes. Whichever table is smaller, > break into pieces that fit in RAM. Unless my relational algebra is off, A > JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1 UNION > B2. > > Hope that helps > -Todd > > > On Thu, May 28, 2009 at 5:02 AM, Stuart White wrote: > >> I need to do a reduce-side join of two datasets. It's a many-to-many >> join; that is, each dataset can can multiple records with any given >> key. >> >> Every description of a reduce-side join I've seen involves >> constructing your keys out of your mapper such that records from one >> dataset will be presented to the reducers before records from the >> second dataset. I should "hold on" to the value from the one dataset >> and remember it as I iterate across the values from the second >> dataset. >> >> This seems like it only works well for one-to-many joins (when one of >> your datasets will only have a single record with any given key). >> This scales well because you're only remembering one value. >> >> In a many-to-many join, if you apply this same algorithm, you'll need >> to remember all values from one dataset, which of course will be >> problematic (and won't scale) when dealing with large datasets with >> large numbers of records with the same keys. >> >> Does an efficient algorithm exist for a many-to-many reduce-side join? >> > >
Re: Efficient algorithm for many-to-many reduce-side join?
I believe PIG, and I know Cascading use a kind of 'spillable' list that can be re-iterated across. PIG's version is a bit more sophisticated last I looked. that said, if you were using either one of them, you wouldn't need to write your own many-to-many join. cheers, ckw On May 28, 2009, at 8:14 AM, Todd Lipcon wrote: One last possible trick to consider: If you were to subclass SequenceFileRecordReader, you'd have access to its seek method, allowing you to rewind the reducer input. You could then implement a block hash join with something like the following pseudocode: ahash = new HashMap(); while (i have ram available) { read a record if the record is from table B, break put the record into ahash } nextAPos = reader.getPos() while (current record is an A record) { skip to next record } firstBPos = reader.getPos() while (current record has current key) { read and join against ahash process joined result } if firstBPos > nextAPos { seek(nextAPos) go back to top } On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon wrote: Hi Stuart, It seems to me like you have a few options. Option 1: Just use a lot of RAM. Unless you really expect many millions of entries on both sides of the join, you might be able to get away with buffering despite its inefficiency. Option 2: Use LocalDirAllocator to find some local storage to spill all of the left table's records to disk in a MapFile format. Then as you iterate over the right table, do lookups in the MapFile. This is really the same as option 1, except that you're using disk as an extension of RAM. Option 3: Convert this to a map-side merge join. Basically what you need to do is sort both tables by the join key, and partition them with the same partitioner into the same number of columns. This way you have an equal number of part-N files for both tables, and within each part- N file they're ordered by join key. In each map task, you open both tableA/ part-N and tableB/part-N and do a sequential merge to perform the join. I believe the CompositeInputFormat class helps with this, though I've never used it. Option 4: Perform the join in several passes. Whichever table is smaller, break into pieces that fit in RAM. Unless my relational algebra is off, A JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1 UNION B2. Hope that helps -Todd On Thu, May 28, 2009 at 5:02 AM, Stuart White >wrote: I need to do a reduce-side join of two datasets. It's a many-to- many join; that is, each dataset can can multiple records with any given key. Every description of a reduce-side join I've seen involves constructing your keys out of your mapper such that records from one dataset will be presented to the reducers before records from the second dataset. I should "hold on" to the value from the one dataset and remember it as I iterate across the values from the second dataset. This seems like it only works well for one-to-many joins (when one of your datasets will only have a single record with any given key). This scales well because you're only remembering one value. In a many-to-many join, if you apply this same algorithm, you'll need to remember all values from one dataset, which of course will be problematic (and won't scale) when dealing with large datasets with large numbers of records with the same keys. Does an efficient algorithm exist for a many-to-many reduce-side join? -- Chris K Wensel ch...@concurrentinc.com http://www.concurrentinc.com
Re: hadoop hardware configuration
Brian Bockelman writes: > Despite my trying, I've never been able to come even close to pegging > the CPUs on our NN. > > I'd recommend going for the fastest dual-cores which are affordable -- > latency is king. Clue? Surely the latencies in Hadoop that dominate are not cured with faster processors, but with more RAM and faster disks? I've followed your posts for a while, so I know you are very experienced with this stuff... help me out here. Ian
Re: hadoop hardware configuration
On May 28, 2009, at 10:32 AM, Ian Soboroff wrote: Brian Bockelman writes: Despite my trying, I've never been able to come even close to pegging the CPUs on our NN. I'd recommend going for the fastest dual-cores which are affordable -- latency is king. Clue? Surely the latencies in Hadoop that dominate are not cured with faster processors, but with more RAM and faster disks? I've followed your posts for a while, so I know you are very experienced with this stuff... help me out here. Actually, that's more of a gut feeling than informed decision. Because the locking is rather coarse-grained, having many CPUs isn't going to win anything -- I'd rather any CPU-related portions to go as fast as possible. Under the highest load, I think we've been able to get up to 25% CPU utilization: thus, I'm guessing any CPU-related improvements will come from faster ones, not more cores. For my cluster, if I had a lot of money, I'd spend it on a hot-spare machine. Then, I'd spend it on upgrading the RAM, followed by disks, followed by CPU. Then again, for the cluster in the original email, I'd save money on the namenode and buy more datanodes. We've got about 200 nodes and probably have a comparable NN. Brian
Re: Appending to a file / updating a file
Olivier, Append is not supported or recommended at this point. You can turn it on via dfs.support.append in hdfs-site.xml under 0.20.0. There have been some issues making it reliable. If this is not production code or a production job then turning it on will probably have no detrimental effect, but be aware it might destroy your data as it is not recommended at this point. Regards Damien On 28/05/2009, at 6:46 AM, Olivier Smadja wrote: Hello, I'm trying hadoop for the first time and I'm just trying to create a file and append some text in it with the following code: import java.io.IOException; import org.apache.hadoop.conf. Configuration; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; /** * @author olivier * */ public class HadoopIO { public static void main(String[] args) throws IOException { String directory = "/Users/olivier/tmp/hadoop-data"; Configuration conf = new Configuration(true); Path path = new Path(directory); // Create the File system FileSystem fs = path.getFileSystem(conf); // Sets the working directory fs.setWorkingDirectory(path); System.out.println(fs.getWorkingDirectory()); // Creates a files FSDataOutputStream out = fs.create(new Path("test.txt")); out.writeBytes("Testing hadoop - first line"); out.close(); // then try to append something out = fs.append(new Path("test.txt")); out.writeBytes("Testing hadoop - second line"); out.close(); fs.close(); } } but I receive the following exception: Exception in thread "main" java.io.IOException: Not supported at org .apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java: 290) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38) 1) Can someone tell me what am i doing wrong? 2) How can I update the file (for example, just update the first 10 bytes of the file)? Thanks, Olivier Damien Cooke Open Scalable Solutions Performance Performance & Applications Engineering Sun Microsystems Level 2, East Wing 50 Grenfell Street, Adelaide SA 5000 Australia Phone x58315 (x7058315 US callers) Email damien.co...@sun.com
Reduce() time takes ~4x Map()
Hi everyone, I'm processing XML files, around 500MB each with several documents, for the map() function I pass a document from the XML file, which takes some time to process depending on the size - I'm applying NER to texts. Each document has a unique identifier, so I'm using that identifier as a key and the results of parsing the document in one string as the output: so at the end of the map function(): output.collect( new Text(identifier), new Text(outputString)); usually the outputString is around 1k-5k size reduce(): public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) { while (values.hasNext()) { Text text = values.next(); try { output.collect(key, text); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } I did a test using only 1 machine with 8 cores, and only 1 XML file, it took around 3 hours to process all maps and ~12hours for the reduces! the XML file has 139 945 documents I set the jobconf for 1000 maps() and 200 reduces() I did took a look at graphs on the web interface during the reduce phase, and indeed its the copy phase that's taking much of the time, the sort and reduce phase are done almost instantly. Why does the copy phase takes so long? I understand that the copies are made using HTTP, and the data was in really small chunks 1k-5k size, but even so, being everything in the same physical machine should have been faster, no? Any suggestions on what might be causing the copies in reduce to take so long? -- ./david
org.apache.hadoop.ipc.client : trying connect to server failed
HI , I am trying to step up a hadoop cluster on 512 MB machine and using hadoop 0.18 and have followed procedure given in apache hadoop site for hadoop cluster. I included in conf/slaves two datanode i.e including the namenode vitrual machine and other machine virtual machine . and have set up passwordless ssh between both virtual machines. But now problem is when is run command >> bin/hadoop start-all.sh It start only one datanode on the same namenode vitrual machine but it doesn't start the datanode on other machine. in logs/hadoop-datanode i get message INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop1/192.168.1.28:9000. Already tried 1 time(s). 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop1/192.168.1.28:9000. Already tried 2 time(s). 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop1/192.168.1.28:9000. Already tried 3 time(s). . . . . . . . . . . . . So can any one help in solving this problem. :) Thanks Regards Ashish Pareek
Re: Persistent storage on EC2
On Tue, May 26, 2009 at 7:50 PM, Malcolm Matalka < mmata...@millennialmedia.com> wrote: > I'm using EBS volumes to have a persistent HDFS on EC2. Do I need to keep > the master updated on how to map the internal IPs, which change as I > understand, to a known set of host names so it knows where the blocks are > located each time I bring a cluster up? If so, is keeping a mapping up to > date in /etc/hosts sufficient? > I can't answer your first question of whether it's necessary. The namenode might be able to figure it out when the DNs report their blocks. Our staging cluster uses the setup you describe, with /etc/hosts pushed out to all the machines, and the EBS volumes always mounted on the same hostname. This works great.
Re: hadoop hardware configuration
On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman wrote: > > We do both -- push the disk image out to NFS and have a mirrored SAS hard > drives on the namenode. The SAS drives appear to be overkill. > This sounds like a nice approach, taking into account hardware, labor and downtime costs... $700 for a RAID controller seems reasonable to minimize maintenance due to a disk failure. Alex's suggestion to go JBOD and write to all volumes would work as well, but slightly more labor intensive. >> 2. What is a good processor-to-storage ratio for a task node with 4TB of >>> raw storage? (The config above has 1 core per 1TB of raw storage.) >>> >> > > We're data hungry locally -- I'd put in bigger hard drives. The 1.5TB > Seagate drives seem to have passed their teething issues, and are at a > pretty sweet price point. They only will scale up to 60 IOPS, so make sure > your workflows don't have lots of random I/O. > I haven't seen too many vendors offering the 1.5TB option. What type of data are you working with? At what volumes? I sense that at 50GB/day, we are higher than average in terms of data volume over time. > As Steve mentions below, the rest is really up to your algorithm. Do you > need 1 CPU second / byte? If so, buy more CPUs. Do you need .1 CPU second > / MB? If so, buy more disks. > Unfortunately, we won't know until we have a cluster to test on. Classic catch-22. We are going to experiment with a small cluster and a small data set, with plans to buy more appropriately sized slave nodes based on what we learn. - P
Re: hadoop hardware configuration
On May 28, 2009, at 2:00 PM, Patrick Angeles wrote: On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman >wrote: We do both -- push the disk image out to NFS and have a mirrored SAS hard drives on the namenode. The SAS drives appear to be overkill. This sounds like a nice approach, taking into account hardware, labor and downtime costs... $700 for a RAID controller seems reasonable to minimize maintenance due to a disk failure. Alex's suggestion to go JBOD and write to all volumes would work as well, but slightly more labor intensive. Remember though that disk failure downtime is actually rather rare. The question is "how tight is your hardware budget": if $700 is worth the extra 1 day of uptime a year, then spend it. I come from an academic background where (a) we don't lose money if things go down and (b) jobs move to another site in the US if things are down. That perhaps gives you a reading into my somewhat relaxed attitude. I'm not a hardware guy anymore, but I'd personally prefer a software RAID. I've seen mirrored disks go down because the RAID controller decided to puke. 2. What is a good processor-to-storage ratio for a task node with 4TB of raw storage? (The config above has 1 core per 1TB of raw storage.) We're data hungry locally -- I'd put in bigger hard drives. The 1.5TB Seagate drives seem to have passed their teething issues, and are at a pretty sweet price point. They only will scale up to 60 IOPS, so make sure your workflows don't have lots of random I/O. I haven't seen too many vendors offering the 1.5TB option. What type of data are you working with? At what volumes? I sense that at 50GB/day, we are higher than average in terms of data volume over time. We have just short of 300TB of raw disk; our daily downloads range from a few GB to 10TB. We bought 1.5TB drives separately from the nodes and sent students with screwdrivers at the cluster. As Steve mentions below, the rest is really up to your algorithm. Do you need 1 CPU second / byte? If so, buy more CPUs. Do you need .1 CPU second / MB? If so, buy more disks. Unfortunately, we won't know until we have a cluster to test on. Classic catch-22. We are going to experiment with a small cluster and a small data set, with plans to buy more appropriately sized slave nodes based on what we learn. In that case, you're probably good! 24TB probably formats out to 20TB. With 2x replication at 50GB a day, you've got enough room for about half a year of data. Hope your procurement process isn't too slow! Brian
Re: hadoop hardware configuration
On Thu, May 28, 2009 at 6:02 AM, Steve Loughran wrote: > That really depends on the work you are doing...the bytes in/out to CPU > work, and the size of any memory structures that are built up over the run. > > With 1 core per physical disk, you get the bandwidth of a single disk per > CPU; for some IO-intensive work you can make the case for two disks/CPU -one > in, one out, but then you are using more power, and if/when you want to add > more storage, you have to pull out the disks to stick in new ones. If you go > for more CPUs, you will probably need more RAM to go with it. > Just to throw a wrench in the works, Intel's Nehalem architecture takes DDR3 memory which are paired in 3's. So for a dual quad core rig, you can get either 6 x 2GB (12GB) or, 6 x 4GB (24GB) for an extra $500. That's a big step up in price for extra memory in a slave node. 12GB probably won't be enough, because the mid-range Nehalems support hyper-threading, so you actually get up to 16 threads running on a dual quad setup. > Then there is the question of where your electricity comes from, what the > limits for the room are, whether you are billed on power drawn or quoted PSU > draw, what the HVAC limits are, what the maximum allowed weight per rack is, > etc, etc. We're going to start with cabinets in a co-location. Most can provide 40amps per cabinet (with up to 80% load), so you could fit around 30 single-socket servers, or 15 dual-socket servers in a single rack. > > I'm a fan of low Joule work, though we don't have any benchmarks yet of the > power efficiency of different clusters; the number of MJ used to do a a > terasort. I'm debating doing some single-cpu tests for this on my laptop, as > the battery knows how much gets used up by some work. > >4. In planning storage capacity, how much spare disk space should I take >> into account for 'scratch'? For now, I'm assuming 1x the input data >> size. >> > > That you should probably be able to determine on experimental work on > smaller datasets. Some maps can throw out a lot of data, most reduces do > actually reduce the final amount. > > > -Steve > > (Disclaimer: I'm not making any official recommendations for hardware here, > just making my opinions known. If you do want an official recommendation > from HP, talk to your reseller or account manager, someone will look at your > problem in more detail and make some suggestions. If you have any code/data > that could be shared for benchmarking, that would help validate those > suggestions) > >
Re: Appending to a file / updating a file
Thanks Damien. And can i update a file with hadoop or just create it and read it later? Olivier On Thu, May 28, 2009 at 1:31 PM, Damien Cooke wrote: > Olivier, > Append is not supported or recommended at this point. You can turn it on > via dfs.support.append in hdfs-site.xml under 0.20.0. There have been some > issues making it reliable. If this is not production code or a production > job then turning it on will probably have no detrimental effect, but be > aware it might destroy your data as it is not recommended at this point. > > Regards > Damien > > On 28/05/2009, at 6:46 AM, Olivier Smadja wrote: > > Hello, >> >> I'm trying hadoop for the first time and I'm just trying to create a file >> and append some text in it with the following code: >> >> >> import java.io.IOException; >> >> import org.apache.hadoop.conf. Configuration; >> import org.apache.hadoop.fs.FSDataOutputStream; >> import org.apache.hadoop.fs.FileSystem; >> import org.apache.hadoop.fs.Path; >> >> /** >> * @author olivier >> * >> */ >> public class HadoopIO { >> >> public static void main(String[] args) throws IOException { >> >> >> String directory = "/Users/olivier/tmp/hadoop-data"; >> Configuration conf = new Configuration(true); >> Path path = new Path(directory); >> // Create the File system >> FileSystem fs = path.getFileSystem(conf); >> // Sets the working directory >> fs.setWorkingDirectory(path); >> >> System.out.println(fs.getWorkingDirectory()); >> >> // Creates a files >> FSDataOutputStream out = fs.create(new Path("test.txt")); >> out.writeBytes("Testing hadoop - first line"); >> out.close(); >> // then try to append something >> out = fs.append(new Path("test.txt")); >> out.writeBytes("Testing hadoop - second line"); >> out.close(); >> >> fs.close(); >> >> >> } >> >> } >> >> >> but I receive the following exception: >> >> Exception in thread "main" java.io.IOException: Not supported >> at >> >> org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290) >> at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525) >> at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38) >> >> >> 1) Can someone tell me what am i doing wrong? >> >> 2) How can I update the file (for example, just update the first 10 bytes >> of >> the file)? >> >> >> Thanks, >> Olivier >> > > > Damien Cooke > Open Scalable Solutions Performance > Performance & Applications Engineering > > Sun Microsystems > Level 2, East Wing 50 Grenfell Street, Adelaide > SA 5000 Australia > Phone x58315 (x7058315 US callers) > Email damien.co...@sun.com > >
How do I convert DataInput and ResultSet to array of String?
Hi, How do I convert DataInput to array of String? How do I convert ResultSet to array of String? Thanks. Following is the code: static class Record implements Writable, DBWritable { String [] aSAssoc; public void write(DataOutput arg0) throws IOException { throw new UnsupportedOperationException("Not supported yet."); } public void readFields(DataInput in) throws IOException { this.aSAssoc = // How to convert DataInput to String Array? } public void write(PreparedStatement arg0) throws SQLException { throw new UnsupportedOperationException("Not supported yet."); } public void readFields(ResultSet rs) throws SQLException { this.aSAssoc = // How to convert ResultSet to String Array? } } -- View this message in context: http://www.nabble.com/How-do-I-convert-DataInput-and-ResultSet-to-array-of-String--tp23770747p23770747.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
New version/API stable?
Hadoop noob here, just starting to learn it, as we're planning to start using it heavily in our processing. Just wondering, though, which version of the code I should start learning/working with. It looks like the Hadoop API changed pretty significantly from 0.19 to 0.20 (e.g., org.apache.hadoop.mapred -> org.apache.hadoop.mapreduce), which leads me to think I should start with the new API. OTOH, since the new release is a ".0" release after some of these major API overhauls, I'm wondering if it's stable enough for us to start using in production. Where'd be best for me to start learning? TIA, DR
Question: index package in contrib (lucene index)
Hi, I am trying to understand the code of index package to build a distributed Lucene index. I have some very basic questions and would really appreciate if someone can help me understand this code- 1) If I already have Lucene index (divided into shards), should I upload these indexes into HDFS and provide its location or the code will pick these shards from local file system ? 2) How is the code adding a document in the lucene index, I can see there is a index selection policy. Assuming round robin policy is chosen, how is the code adding a document in the lucene index? This is related to first question - is the index where the new document is to be added in HDFS or in local file system. I read in the README that the index is first created on local file system, then copied back to HDFS. Can someone please point me to the code that is doing this. 3) After the map reduce job finishes, where are the final indexes ? In HDFS ? 4) Correct me if I am wrong- the code builds multiple indexes, where each index is an instance of Lucene Index having a disjoint subset of documents from the corpus. So, if I have to search a term, I have to search each index and then merge the result. If this is correct, then how is the IDF of a term which is a global statistic computed and updated in each index ? I mean each index can compute the IDF wrt. to the subset of documents that it has, but can not compute the global IDF of a term (since it knows nothing about other indexes, which might have the same term in other documents). Thanks, -T
Re: New version/API stable?
0.19 is considered unstable by us at Cloudera and by the Y! folks; they never deployed it to their clusters. That said, we recommend 0.18.3 as the most stable version of Hadoop right now. Y! has (or will soon) deploy(ed) 0.20, which implies that it's at least stable enough for them to give it a go. Cloudera plans to support 0.20 as soon as a few more bugs get flushed out, which will probably happen in its next minor release. So anyway, that said, it might make sense for you to start with 0.20.0, as long as you understand that the first major release usually is pretty buggy, and is basically considered a beta. If you're not willing to take the stability risk, then I'd recommend going with 0.18.3, though the upgrade from 0.18.3 to 0.20.X is going to be a headache (APIs changed, configuration files changed, etc.). Hope this is insightful. Alex On Thu, May 28, 2009 at 2:59 PM, David Rosenstrauch wrote: > Hadoop noob here, just starting to learn it, as we're planning to start > using it heavily in our processing. Just wondering, though, which version > of the code I should start learning/working with. > > It looks like the Hadoop API changed pretty significantly from 0.19 to > 0.20 (e.g., org.apache.hadoop.mapred -> org.apache.hadoop.mapreduce), > which leads me to think I should start with the new API. OTOH, since the > new release is a ".0" release after some of these major API overhauls, I'm > wondering if it's stable enough for us to start using in production. > > Where'd be best for me to start learning? > > TIA, > > DR > >
MultipleOutputs or MultipleTextOutputFormat?
I am trying to figure out the best way to split output into different directories. My goal is to have a directory structure allowing me to add the content from each batch into the right bucket, like this: ... /content/200904/batch_20090429 /content/200904/batch_20090430 /content/200904/batch_20090501 /content/200904/batch_20090502 /content/200905/batch_20090430 /content/200905/batch_20090501 /content/200905/batch_20090502 ... I would then run my nightly jobs to build the index on /content/200904/* for the April index and /content/200905/* for the May index. I'm not sure whether I would be better off using MultipleOutputs or MultipleTextOutputFormat. I'm having trouble understanding how I set the output path for these two classes. It seems like MultipleTextOutputFormat is about partitioning data to different files within the same directory on the key, rather than into different directories as I need. Could I get the behavior I want by specifying date/batch as my filename, set output path to some temporary work directory, then move /work/* to /content? MultipleOutputs seems to be more about outputting all the data in different formats, but it's supposed to be simpler to use. Reading it, it seems to be better documented and the API makes more sense (choosing the output explicitly in the map or reduce, rather than hiding this decision in the output format), but I don't see any way to set a file name. If am using textoutputformat, I see no way to put these into different directories.
Re: InputFormat for fixed-width records?
On Thu, May 28, 2009 at 9:50 AM, Owen O'Malley wrote: > > The update to the terasort example has an InputFormat that does exactly > that. The key is 10 bytes and the value is the next 90 bytes. It is pretty > easy to write, but I should upload it soon. The output types are Text, but > they just have the binary data in them. > Would you mind uploading it or sending it to the list?
Re: Reduce() time takes ~4x Map()
Hi David, If you go to JobTrackerHistory and then click on this job and then do Analyse This Job, you should be able to get the split up timings for the individual phases of the map and reduce tasks, including the average, best and worst times. Could you provide those numbers so that we can get a better idea of how the job progressed. Jothi On 5/28/09 10:11 PM, "David Batista" wrote: > Hi everyone, > > I'm processing XML files, around 500MB each with several documents, > for the map() function I pass a document from the XML file, which > takes some time to process depending on the size - I'm applying NER to > texts. > > Each document has a unique identifier, so I'm using that identifier as > a key and the results of parsing the document in one string as the > output: > > so at the end of the map function(): > output.collect( new Text(identifier), new Text(outputString)); > > usually the outputString is around 1k-5k size > > reduce(): > public void reduce(Text key, Iterator values, > OutputCollector output, Reporter reporter) { > while (values.hasNext()) { > Text text = values.next(); > try { > output.collect(key, text); > } catch (IOException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > } > } > > I did a test using only 1 machine with 8 cores, and only 1 XML file, > it took around 3 hours to process all maps and ~12hours for the > reduces! > > the XML file has 139 945 documents > > I set the jobconf for 1000 maps() and 200 reduces() > > I did took a look at graphs on the web interface during the reduce > phase, and indeed its the copy phase that's taking much of the time, > the sort and reduce phase are done almost instantly. > > Why does the copy phase takes so long? I understand that the copies > are made using HTTP, and the data was in really small chunks 1k-5k > size, but even so, being everything in the same physical machine > should have been faster, no? > > Any suggestions on what might be causing the copies in reduce to take so long? > -- > ./david
Re: Reduce() time takes ~4x Map()
At the minimal level, enable map output compression, it may make some difference, mapred.compress.map.output. Sorting is very expensive when there are many keys and the values are large. Are you quite certain your keys are unique. Also, do you need them sorted by document id? On Thu, May 28, 2009 at 8:51 PM, Jothi Padmanabhan wrote: > Hi David, > > If you go to JobTrackerHistory and then click on this job and then do > Analyse This Job, you should be able to get the split up timings for the > individual phases of the map and reduce tasks, including the average, best > and worst times. Could you provide those numbers so that we can get a > better > idea of how the job progressed. > > Jothi > > > On 5/28/09 10:11 PM, "David Batista" wrote: > > > Hi everyone, > > > > I'm processing XML files, around 500MB each with several documents, > > for the map() function I pass a document from the XML file, which > > takes some time to process depending on the size - I'm applying NER to > > texts. > > > > Each document has a unique identifier, so I'm using that identifier as > > a key and the results of parsing the document in one string as the > > output: > > > > so at the end of the map function(): > > output.collect( new Text(identifier), new Text(outputString)); > > > > usually the outputString is around 1k-5k size > > > > reduce(): > > public void reduce(Text key, Iterator values, > > OutputCollector output, Reporter reporter) { > > while (values.hasNext()) { > > Text text = values.next(); > > try { > > output.collect(key, text); > > } catch (IOException e) { > > // TODO Auto-generated catch block > > e.printStackTrace(); > > } > > } > > } > > > > I did a test using only 1 machine with 8 cores, and only 1 XML file, > > it took around 3 hours to process all maps and ~12hours for the > > reduces! > > > > the XML file has 139 945 documents > > > > I set the jobconf for 1000 maps() and 200 reduces() > > > > I did took a look at graphs on the web interface during the reduce > > phase, and indeed its the copy phase that's taking much of the time, > > the sort and reduce phase are done almost instantly. > > > > Why does the copy phase takes so long? I understand that the copies > > are made using HTTP, and the data was in really small chunks 1k-5k > > size, but even so, being everything in the same physical machine > > should have been faster, no? > > > > Any suggestions on what might be causing the copies in reduce to take so > long? > > -- > > ./david > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: org.apache.hadoop.ipc.client : trying connect to server failed
Hi some help me out On Thu, May 28, 2009 at 10:32 PM, ashish pareek wrote: > HI , > I am trying to step up a hadoop cluster on 512 MB machine and using > hadoop 0.18 and have followed procedure given in apache hadoop site for > hadoop cluster. > I included in conf/slaves two datanode i.e including the namenode > vitrual machine and other machine virtual machine . and have set up > passwordless ssh between both virtual machines. But now problem is when > is run command >> > > bin/hadoop start-all.sh > > It start only one datanode on the same namenode vitrual machine but it > doesn't start the datanode on other machine. > > in logs/hadoop-datanode i get message > > > INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: hadoop1/192.168.1.28:9000. Already > tried 1 time(s). > > 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: hadoop1/192.168.1.28:9000. Already tried 2 time(s). > 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: hadoop1/192.168.1.28:9000. Already tried 3 time(s). > > . > . > . > . > . > . > . > . > . > > . > . > > . > > > So can any one help in solving this problem. :) > > Thanks > > Regards > > Ashish Pareek > > >
Re: Efficient algorithm for many-to-many reduce-side join?
Use the mapside join stuff, if I understand your problem it provides a good solution but requires getting over the learning hurdle. Well described in chapter 8 of my book :) On Thu, May 28, 2009 at 8:29 AM, Chris K Wensel wrote: > I believe PIG, and I know Cascading use a kind of 'spillable' list that can > be re-iterated across. PIG's version is a bit more sophisticated last I > looked. > > that said, if you were using either one of them, you wouldn't need to write > your own many-to-many join. > > cheers, > ckw > > > On May 28, 2009, at 8:14 AM, Todd Lipcon wrote: > > One last possible trick to consider: >> >> If you were to subclass SequenceFileRecordReader, you'd have access to its >> seek method, allowing you to rewind the reducer input. You could then >> implement a block hash join with something like the following pseudocode: >> >> ahash = new HashMap(); >> while (i have ram available) { >> read a record >> if the record is from table B, break >> put the record into ahash >> } >> nextAPos = reader.getPos() >> >> while (current record is an A record) { >> skip to next record >> } >> firstBPos = reader.getPos() >> >> while (current record has current key) { >> read and join against ahash >> process joined result >> } >> >> if firstBPos > nextAPos { >> seek(nextAPos) >> go back to top >> } >> >> >> On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon wrote: >> >> Hi Stuart, >>> >>> It seems to me like you have a few options. >>> >>> Option 1: Just use a lot of RAM. Unless you really expect many millions >>> of >>> entries on both sides of the join, you might be able to get away with >>> buffering despite its inefficiency. >>> >>> Option 2: Use LocalDirAllocator to find some local storage to spill all >>> of >>> the left table's records to disk in a MapFile format. Then as you iterate >>> over the right table, do lookups in the MapFile. This is really the same >>> as >>> option 1, except that you're using disk as an extension of RAM. >>> >>> Option 3: Convert this to a map-side merge join. Basically what you need >>> to >>> do is sort both tables by the join key, and partition them with the same >>> partitioner into the same number of columns. This way you have an equal >>> number of part-N files for both tables, and within each part-N >>> file >>> they're ordered by join key. In each map task, you open both >>> tableA/part-N >>> and tableB/part-N and do a sequential merge to perform the join. I >>> believe >>> the CompositeInputFormat class helps with this, though I've never used >>> it. >>> >>> Option 4: Perform the join in several passes. Whichever table is smaller, >>> break into pieces that fit in RAM. Unless my relational algebra is off, A >>> JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1 >>> UNION >>> B2. >>> >>> Hope that helps >>> -Todd >>> >>> >>> On Thu, May 28, 2009 at 5:02 AM, Stuart White >> >wrote: >>> >>> I need to do a reduce-side join of two datasets. It's a many-to-many join; that is, each dataset can can multiple records with any given key. Every description of a reduce-side join I've seen involves constructing your keys out of your mapper such that records from one dataset will be presented to the reducers before records from the second dataset. I should "hold on" to the value from the one dataset and remember it as I iterate across the values from the second dataset. This seems like it only works well for one-to-many joins (when one of your datasets will only have a single record with any given key). This scales well because you're only remembering one value. In a many-to-many join, if you apply this same algorithm, you'll need to remember all values from one dataset, which of course will be problematic (and won't scale) when dealing with large datasets with large numbers of records with the same keys. Does an efficient algorithm exist for a many-to-many reduce-side join? >>> >>> > -- > Chris K Wensel > ch...@concurrentinc.com > http://www.concurrentinc.com > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: org.apache.hadoop.ipc.client : trying connect to server failed
make sure u can ping that data node and ssh it. On Thu, May 28, 2009 at 12:02 PM, ashish pareek wrote: > HI , > I am trying to step up a hadoop cluster on 512 MB machine and using > hadoop 0.18 and have followed procedure given in apache hadoop site for > hadoop cluster. > I included in conf/slaves two datanode i.e including the namenode > vitrual machine and other machine virtual machine . and have set up > passwordless ssh between both virtual machines. But now problem is when > is run command >> > > bin/hadoop start-all.sh > > It start only one datanode on the same namenode vitrual machine but it > doesn't start the datanode on other machine. > > in logs/hadoop-datanode i get message > > > INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: hadoop1/192.168.1.28:9000. Already > tried 1 time(s). > 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: hadoop1/192.168.1.28:9000. Already tried 2 time(s). > 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: hadoop1/192.168.1.28:9000. Already tried 3 time(s). > > . > . > . > . > . > . > . > . > . > > . > . > > . > > > So can any one help in solving this problem. :) > > Thanks > > Regards > Ashish Pareek >
Re: org.apache.hadoop.ipc.client : trying connect to server failed
Yes I am able to ping and ssh between two virtual machine and even i have set ip address of both the virtual machines in their respective /etc/hosts file ... thanx for reply .. if you suggest some other thing which i could have missed or any remedy Regards, Ashish Pareek. On Fri, May 29, 2009 at 10:04 AM, Pankil Doshi wrote: > make sure u can ping that data node and ssh it. > > > On Thu, May 28, 2009 at 12:02 PM, ashish pareek > wrote: > > > HI , > > I am trying to step up a hadoop cluster on 512 MB machine and using > > hadoop 0.18 and have followed procedure given in apache hadoop site for > > hadoop cluster. > > I included in conf/slaves two datanode i.e including the namenode > > vitrual machine and other machine virtual machine . and have set up > > passwordless ssh between both virtual machines. But now problem is > when > > is run command >> > > > > bin/hadoop start-all.sh > > > > It start only one datanode on the same namenode vitrual machine but it > > doesn't start the datanode on other machine. > > > > in logs/hadoop-datanode i get message > > > > > > INFO org.apache.hadoop.ipc.Client: Retrying > > connect to server: hadoop1/192.168.1.28:9000. Already > > tried 1 time(s). > > 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying > > connect to server: hadoop1/192.168.1.28:9000. Already tried 2 time(s). > > 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying > > connect to server: hadoop1/192.168.1.28:9000. Already tried 3 time(s). > > > > . > > . > > . > > . > > . > > . > > . > > . > > . > > > > . > > . > > > > . > > > > > > So can any one help in solving this problem. :) > > > > Thanks > > > > Regards > > Ashish Pareek > > >
RE:SequenceFile and streaming
Hi Tom, i have seen the tar-to-seq tool but the person who made it says it is very slow: "It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file". To me it is not acceptable. Normally to generate a tar file from 615MB od data it take s less then one minute. And, in my view the generatin of a sequence file should be even simper. You have just to append files and headers without worring about hierarchy. Regarding the SequenceFileAsTextInputFormat I am not sure it will do the job I am looking for. The hadoop documentation says: SequenceFileAsTextInputFormat generates SequenceFileAsTextRecordReader which converts the input keys and values to their String forms by calling toString() method. Let we suppose that the keys and values were generated using tar-to-seq on a tar archive. Each value is a bytearray that stores the content of a file which can be any kind of data (in example a jpeg picture). It doesn't make sense to convert this data into a string. What is needed is a tool to simply extract the file as with tar -xf archive.tar filename. The hadoop framework can be used to extract a Java class and you have to do that within a java program. The streaming package is meant to be used in a unix shell without the need of java programming. But I think it is not very usefull if the sequencefile (which is the principal data structure of hadoop) is not accessible from a shell command. Walter
Re: MultipleOutputs or MultipleTextOutputFormat?
One way of doing what you need is to extend MultipleTextOutputFormat and override the following APIs - generateFileNameForKeyValue() - generateActualKey() - generateActualValue() You will need to prefix the directory and file-name of your choice to the key/value depending upon your needs. Assuming key and value types to be Text here is some sample code for reference public String generateFileNameForKeyValue(Text key, Text v, String name) { /* * split the default name (for e.x. part-0 into ['part', '0'] ) */ String[] nameparts = name.split("-"); String keyStr = key.toString(); /** * assuming desired filename is prefixed to the key and separated from the * actual key contents by '\t' */ int idx = keyStr.indexOf("\t"); /* * get the file name */ name = keyStr.substring(0, idx); /** * return the path of the form 'fileName/fileName-' * This makes sure that fileName dir is created under job's output dir * and all the keys with that prefix go into reducer-specific files under * that dir. */ return new Path(name, name + "-" + nameparts[1]).toString(); } public Text generateActualKey(Text key, Text value) { String keyStr = key.toString(); int idx = keyStr.indexOf("\t") + 1; return new Text(keyStr.substring(idx)); } Hope that helps. -Ankur - Original Message - From: "Kevin Peterson" To: core-user@hadoop.apache.org Sent: Friday, May 29, 2009 4:55:22 AM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi Subject: MultipleOutputs or MultipleTextOutputFormat? I am trying to figure out the best way to split output into different directories. My goal is to have a directory structure allowing me to add the content from each batch into the right bucket, like this: ... /content/200904/batch_20090429 /content/200904/batch_20090430 /content/200904/batch_20090501 /content/200904/batch_20090502 /content/200905/batch_20090430 /content/200905/batch_20090501 /content/200905/batch_20090502 ... I would then run my nightly jobs to build the index on /content/200904/* for the April index and /content/200905/* for the May index. I'm not sure whether I would be better off using MultipleOutputs or MultipleTextOutputFormat. I'm having trouble understanding how I set the output path for these two classes. It seems like MultipleTextOutputFormat is about partitioning data to different files within the same directory on the key, rather than into different directories as I need. Could I get the behavior I want by specifying date/batch as my filename, set output path to some temporary work directory, then move /work/* to /content? MultipleOutputs seems to be more about outputting all the data in different formats, but it's supposed to be simpler to use. Reading it, it seems to be better documented and the API makes more sense (choosing the output explicitly in the map or reduce, rather than hiding this decision in the output format), but I don't see any way to set a file name. If am using textoutputformat, I see no way to put these into different directories.