Re: Hadoop streaming - No room for reduce task error
The reduce output may spill to disk during the sort, and if it expected to be larger than the partition free space, unless the machine/jvm has a hugh allowed memory space, the data will spill to disk during the sort. If I did my math correctly, you are trying to push ~2TB through the single reduce. as for the part- files, if you have the number of reduces set to zero, you will get N part files, where N is the number of map tasks. If you absolutely must have it all go to one reduce, you will need to increase the free disk space. I think 19.1 preserves compression for the map output, so you could try enabling compression for map output. If you have many nodes, you can set the number of reduces to some number and then use sort -M on the part files, to merge sort them, assuming your reduce preserves ordering. Try adding these parameters to your job line: -D mapred.compress.map.output=true -D mapred.output.compression.type=BLOCK BTW, /bin/cat works fine as an identity mapper or an identity reducer On Wed, Jun 10, 2009 at 5:31 PM, Todd Lipcon wrote: > Hey Scott, > It turns out that Alex's answer was mistaken - your error is actually > coming > from lack of disk space on the TT that has been assigned the reduce task. > Specifically, there is not enough space in mapred.local.dir. You'll need to > change your mapred.local.dir to point to a partition that has enough space > to contain your reduce output. > > As for why this is the case, I hope someone will pipe up. It seems to me > that reduce output can go directly to the target filesystem without using > space on mapred.local.dir. > > Thanks > -Todd > > On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard > wrote: > > > What is mapred.child.ulimit set to? This configuration options specifics > > how much memory child processes are allowed to have. You may want to up > > this limit and see what happens. > > > > Let me know if that doesn't get you anywhere. > > > > Alex > > > > On Wed, Jun 10, 2009 at 9:40 AM, Scott wrote: > > > > > Complete newby map/reduce question here. I am using hadoop streaming > as > > I > > > come from a Perl background, and am trying to prototype/test a process > to > > > load/clean-up ad server log lines from multiple input files into one > > large > > > file on the hdfs that can then be used as the source of a hive db > table. > > > I have a perl map script that reads an input line from stdin, does the > > > needed cleanup/manipulation, and writes back to stdout.I don't > really > > > need a reduce step, as I don't care what order the lines are written > in, > > and > > > there is no summary data to produce. When I run the job with -reducer > > NONE > > > I get valid output, however I get multiple part-x files rather than > > one > > > big file. > > > So I wrote a trivial 'reduce' script that reads from stdin and simply > > > splits the key/value, and writes the value back to stdout. > > > > > > I am executing the code as follows: > > > > > > ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper > > > "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer > > > "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input > > /logs/*.log > > > -output test9 > > > > > > The code I have works when given a small set of input files. However, > I > > > get the following error when attempting to run the code on a large set > of > > > input files: > > > > > > hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 > > 15:43:00,905 > > > WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. > > Node > > > tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has > 2004049920 > > > bytes free; but we expect reduce input to take 22138478392 > > > > > > I assume this is because the all the map output is being buffered in > > memory > > > prior to running the reduce step? If so, what can I change to stop the > > > buffering? I just need the map output to go directly to one large > file. > > > > > > Thanks, > > > Scott > > > > > > > > > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
RE: Large size Text file split
I don't understand the internal logic of the FileSplit and Mapper. By my understanding, I think FileInputFormat is the actual class that takes care of the file spliting. So it's reasonable if one large file is splited into 5 smaller parts, each parts is less than 2GB (since we specify the numberOfSplit is 5). However, the FileSplit is rough edges, so mapper 1 which takes the split 1 as input omit the incomplete parts at the end of split 1, then mapper 2 will continue to read that incomplete part then add the remaining part of split 2? Take this as example: The original file is: 1::122::5::838985046 (CRLF) 1::185::5::838983525 (CRLF) 1::231::5::838983392 (CRLF) Assume number of split is 2, then the above content is divied into two part: Split 1: 1::122::5::838985046 (CRLF) 1::185::5::8 Split 2: 38983525 (CRLF) 1::231::5::838983392 (CRLF) Afterwards, Mapper 1 takes split 1 as input, but after eat the line 1::122::5::838985046, it found the remaining part is not a complete record, then Mapper 1 bypass it, but Mapper 2 will read this and add it ahead of first line of Split 2 to compose a valid record. Is it correct ? If it is, which class implements the above logic? BR/anderson -Original Message- From: Aaron Kimball [mailto:aa...@cloudera.com] Sent: Thursday, June 11, 2009 11:49 AM To: core-user@hadoop.apache.org Subject: Re: Large size Text file split The FileSplit boundaries are "rough" edges -- the mapper responsible for the previous split will continue until it finds a full record, and the next mapper will read ahead and only start on the first record boundary after the byte offset. - Aaron On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo wrote: > I think the default TextInputFormat can meet my requirement. However, > even if the JavaDoc of TextInputFormat says the TextInputFormat could > divide input file as text lines which ends with CRLF. But I'd like to > know if the FileSplit size is not N times of line length, what will be > happen eventually? > > BR/anderson > > -Original Message- > From: jason hadoop [mailto:jason.had...@gmail.com] > Sent: Wednesday, June 10, 2009 8:39 PM > To: core-user@hadoop.apache.org > Subject: Re: Large size Text file split > > There is always NLineInputFormat. You specify the number of lines per > split. > The key is the position of the line start in the file, value is the > line itself. > The parameter mapred.line.input.format.linespermap controls the number > of lines per split > > On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi < > harish.mallipe...@gmail.com> wrote: > > > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo > > > > wrote: > > > > > Hi, all > > > > > > I have a large csv file ( larger than 10 GB ), I'd like to use a > > > certain InputFormat to split it into smaller part thus each Mapper > > > can deal with piece of the csv file. However, as far as I know, > > > FileInputFormat only cares about byte size of file, that is, the > > > class can divide the csv file as many part, and maybe some part is > not a well-format CVS file. > > > For example, one line of the CSV file is not terminated with CRLF, > > > or maybe some text is trimed. > > > > > > How to ensure each FileSplit is a smaller valid CSV file using a > > > proper InputFormat? > > > > > > BR/anderson > > > > > > > If all you care about is the splits occurring at line boundaries, > > then > > > TextInputFormat will work. > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/map > > re > > d/TextInputFormat.html > > > > If not I guess you can write your own InputFormat class. > > > > -- > > Harish Mallipeddi > > http://blog.poundbang.in > > > > > > -- > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > http://www.apress.com/book/view/9781430219422 > www.prohadoopbook.com a community for Hadoop Professionals >
Re: Large size Text file split
The FileSplit boundaries are "rough" edges -- the mapper responsible for the previous split will continue until it finds a full record, and the next mapper will read ahead and only start on the first record boundary after the byte offset. - Aaron On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo wrote: > I think the default TextInputFormat can meet my requirement. However, > even if the JavaDoc of TextInputFormat says the TextInputFormat could > divide input file as text lines which ends with CRLF. But I'd like to > know if the FileSplit size is not N times of line length, what will be > happen eventually? > > BR/anderson > > -Original Message- > From: jason hadoop [mailto:jason.had...@gmail.com] > Sent: Wednesday, June 10, 2009 8:39 PM > To: core-user@hadoop.apache.org > Subject: Re: Large size Text file split > > There is always NLineInputFormat. You specify the number of lines per > split. > The key is the position of the line start in the file, value is the line > itself. > The parameter mapred.line.input.format.linespermap controls the number > of lines per split > > On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi < > harish.mallipe...@gmail.com> wrote: > > > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo > > wrote: > > > > > Hi, all > > > > > > I have a large csv file ( larger than 10 GB ), I'd like to use a > > > certain InputFormat to split it into smaller part thus each Mapper > > > can deal with piece of the csv file. However, as far as I know, > > > FileInputFormat only cares about byte size of file, that is, the > > > class can divide the csv file as many part, and maybe some part is > not a well-format CVS file. > > > For example, one line of the CSV file is not terminated with CRLF, > > > or maybe some text is trimed. > > > > > > How to ensure each FileSplit is a smaller valid CSV file using a > > > proper InputFormat? > > > > > > BR/anderson > > > > > > > If all you care about is the splits occurring at line boundaries, then > > > TextInputFormat will work. > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre > > d/TextInputFormat.html > > > > If not I guess you can write your own InputFormat class. > > > > -- > > Harish Mallipeddi > > http://blog.poundbang.in > > > > > > -- > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > http://www.apress.com/book/view/9781430219422 > www.prohadoopbook.com a community for Hadoop Professionals >
Re: Multiple NIC Cards
So it turns out the reason that I was getting the duey.local. was because that is what was in the reverse DNS on the nameserver from a previous test. So that is fixed, and now the machine says duey.local.xxx.com. The only remaining issue is the trailing "." (Period) that is required by DNS to make the name fully qualified. So not sure if this is a bug in the Hadoop uses this information or some other issue. If anybody has run across this issue before any help would be greatly appreciated. Thank you, -John On Jun 10, 2009, at 9:21 PM, Matt Massie wrote: If you look at the documentation for the getCanonicalHostName() function (thanks, Steve)... http://java.sun.com/javase/6/docs/api/java/net/InetAddress.html#getCanonicalHostName() you'll see two Java security properties (networkaddress.cache.ttl and networkaddress.cache.negative.ttl). You might take a look at your /etc/nsswitch.conf configuration as well to learn how hosts are resolved on your machine, e.g... $ grep hosts /etc/nsswitch.conf hosts: files dns and lastly, you may want to check if you are running nscd (the NameService cache daemon). If you are, take a look at /etc/ nscd.conf for the caching policy it's using. Good luck. -Matt On Jun 10, 2009, at 1:09 PM, John Martyniak wrote: That is what I thought also, is that it needs to keep that information somewhere, because it needs to be able to communicate with all of the servers. So I deleted the /tmp/had* and /tmp/hs* directories, removed the log files, and grepped for the duey name in all files in config. And the problem still exists. Originally I thought that it might have had something to do with multiple entries in the .ssh/ authorized_keys file but removed everything there. And the problem still existed. So I think that I am going to grab a new install of hadoop 0.19.1, delete the existing one and start out fresh to see if that changes anything. Wish me luck:) -John On Jun 10, 2009, at 12:30 PM, Steve Loughran wrote: John Martyniak wrote: Does hadoop "cache" the server names anywhere? Because I changed to using DNS for name resolution, but when I go to the nodes view, it is trying to view with the old name. And I changed the hadoop-site.xml file so that it no longer has any of those values. in SVN head, we try and get Java to tell us what is going on http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java This uses InetAddress.getLocalHost().getCanonicalHostName() to get the value, which is cached for life of the process. I don't know of anything else, but wouldn't be surprised -the Namenode has to remember the machines where stuff was stored. John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c: 303-522-1756 e: j...@beforedawnsoutions.com w: http://www.beforedawnsolutions.com John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c: 303-522-1756 e: j...@beforedawnsoutions.com w: http://www.beforedawnsolutions.com
Re: Multiple NIC Cards
Matt, Thanks for the suggestion. I had actually forgotten about local dns caching. I am using a mac so I used dscacheutil -flushcache To clear the cache, and also investigated the ordering. And everything seems to be in order. Except I still get a bogus result. it is using the old name except with a trailing period. So it is using duey.local. when it should be using duey.local.x.com (which is the internal name). -John On Jun 10, 2009, at 9:21 PM, Matt Massie wrote: If you look at the documentation for the getCanonicalHostName() function (thanks, Steve)... http://java.sun.com/javase/6/docs/api/java/net/InetAddress.html#getCanonicalHostName() you'll see two Java security properties (networkaddress.cache.ttl and networkaddress.cache.negative.ttl). You might take a look at your /etc/nsswitch.conf configuration as well to learn how hosts are resolved on your machine, e.g... $ grep hosts /etc/nsswitch.conf hosts: files dns and lastly, you may want to check if you are running nscd (the NameService cache daemon). If you are, take a look at /etc/ nscd.conf for the caching policy it's using. Good luck. -Matt On Jun 10, 2009, at 1:09 PM, John Martyniak wrote: That is what I thought also, is that it needs to keep that information somewhere, because it needs to be able to communicate with all of the servers. So I deleted the /tmp/had* and /tmp/hs* directories, removed the log files, and grepped for the duey name in all files in config. And the problem still exists. Originally I thought that it might have had something to do with multiple entries in the .ssh/ authorized_keys file but removed everything there. And the problem still existed. So I think that I am going to grab a new install of hadoop 0.19.1, delete the existing one and start out fresh to see if that changes anything. Wish me luck:) -John On Jun 10, 2009, at 12:30 PM, Steve Loughran wrote: John Martyniak wrote: Does hadoop "cache" the server names anywhere? Because I changed to using DNS for name resolution, but when I go to the nodes view, it is trying to view with the old name. And I changed the hadoop-site.xml file so that it no longer has any of those values. in SVN head, we try and get Java to tell us what is going on http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java This uses InetAddress.getLocalHost().getCanonicalHostName() to get the value, which is cached for life of the process. I don't know of anything else, but wouldn't be surprised -the Namenode has to remember the machines where stuff was stored. John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c: 303-522-1756 e: j...@beforedawnsoutions.com w: http://www.beforedawnsolutions.com John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c: 303-522-1756 e: j...@beforedawnsoutions.com w: http://www.beforedawnsolutions.com
Re: code overview document
Cloudera's long-form training videos about _using_ Hadoop are excellent -- here's the intro video, and you can jump to other topics using the links in the sidebar: http://www.cloudera.com/hadoop-training-thinking-at-scale Not sure what starting points to recommend for modifying Hadoop itself, though. Zak On Wed, Jun 10, 2009 at 11:06 PM, bharath vissapragada wrote: > As far as my knowledge goes , there are no good tutorials for coding on > hadoop.. Yahoo provides a tutorial which is fa better than others > http://developer.yahoo.com/hadoop/tutorial/ . If anyone knows any good > mapreduce programming tutorials ..please let us know > Thanks! > > On Thu, Jun 11, 2009 at 8:30 AM, Anurag Gujral wrote: > >> Hi All, >> Is there a detailed documentation on hadoop code.? >> Thanks >> Anurag >> >
2009 Hadoop Summit West - was wonderful
I had a great time, smoozing with people, and enjoyed a couple of the talks I would love to see more from Pria Narasimhan, hope their toolset for automated fault detection in hadoop clusters becomes generally available. Zookeeper rocks on! Hbase is starting to look really good, in 0.20 the master node and the single point of failure and configuration headache goes away and Zookeeper takes over. Owen O'Mally ave a solid presentation on the new Hadoop API's and the reasons for the changes. It was good to hang with everyone, see you all next year! I even got to spend a little time chatting with Tom White, and a signed copy of his book, thanks Tom! -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: code overview document
As far as my knowledge goes , there are no good tutorials for coding on hadoop.. Yahoo provides a tutorial which is fa better than others http://developer.yahoo.com/hadoop/tutorial/ . If anyone knows any good mapreduce programming tutorials ..please let us know Thanks! On Thu, Jun 11, 2009 at 8:30 AM, Anurag Gujral wrote: > Hi All, > Is there a detailed documentation on hadoop code.? > Thanks > Anurag >
code overview document
Hi All, Is there a detailed documentation on hadoop code.? Thanks Anurag
RE: Large size Text file split
I think the default TextInputFormat can meet my requirement. However, even if the JavaDoc of TextInputFormat says the TextInputFormat could divide input file as text lines which ends with CRLF. But I'd like to know if the FileSplit size is not N times of line length, what will be happen eventually? BR/anderson -Original Message- From: jason hadoop [mailto:jason.had...@gmail.com] Sent: Wednesday, June 10, 2009 8:39 PM To: core-user@hadoop.apache.org Subject: Re: Large size Text file split There is always NLineInputFormat. You specify the number of lines per split. The key is the position of the line start in the file, value is the line itself. The parameter mapred.line.input.format.linespermap controls the number of lines per split On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi < harish.mallipe...@gmail.com> wrote: > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo > wrote: > > > Hi, all > > > > I have a large csv file ( larger than 10 GB ), I'd like to use a > > certain InputFormat to split it into smaller part thus each Mapper > > can deal with piece of the csv file. However, as far as I know, > > FileInputFormat only cares about byte size of file, that is, the > > class can divide the csv file as many part, and maybe some part is not a well-format CVS file. > > For example, one line of the CSV file is not terminated with CRLF, > > or maybe some text is trimed. > > > > How to ensure each FileSplit is a smaller valid CSV file using a > > proper InputFormat? > > > > BR/anderson > > > > If all you care about is the splits occurring at line boundaries, then > TextInputFormat will work. > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre > d/TextInputFormat.html > > If not I guess you can write your own InputFormat class. > > -- > Harish Mallipeddi > http://blog.poundbang.in > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Multiple NIC Cards
If you look at the documentation for the getCanonicalHostName() function (thanks, Steve)... http://java.sun.com/javase/6/docs/api/java/net/InetAddress.html#getCanonicalHostName() you'll see two Java security properties (networkaddress.cache.ttl and networkaddress.cache.negative.ttl). You might take a look at your /etc/nsswitch.conf configuration as well to learn how hosts are resolved on your machine, e.g... $ grep hosts /etc/nsswitch.conf hosts: files dns and lastly, you may want to check if you are running nscd (the NameService cache daemon). If you are, take a look at /etc/nscd.conf for the caching policy it's using. Good luck. -Matt On Jun 10, 2009, at 1:09 PM, John Martyniak wrote: That is what I thought also, is that it needs to keep that information somewhere, because it needs to be able to communicate with all of the servers. So I deleted the /tmp/had* and /tmp/hs* directories, removed the log files, and grepped for the duey name in all files in config. And the problem still exists. Originally I thought that it might have had something to do with multiple entries in the .ssh/ authorized_keys file but removed everything there. And the problem still existed. So I think that I am going to grab a new install of hadoop 0.19.1, delete the existing one and start out fresh to see if that changes anything. Wish me luck:) -John On Jun 10, 2009, at 12:30 PM, Steve Loughran wrote: John Martyniak wrote: Does hadoop "cache" the server names anywhere? Because I changed to using DNS for name resolution, but when I go to the nodes view, it is trying to view with the old name. And I changed the hadoop- site.xml file so that it no longer has any of those values. in SVN head, we try and get Java to tell us what is going on http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java This uses InetAddress.getLocalHost().getCanonicalHostName() to get the value, which is cached for life of the process. I don't know of anything else, but wouldn't be surprised -the Namenode has to remember the machines where stuff was stored. John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c: 303-522-1756 e: j...@beforedawnsoutions.com w: http://www.beforedawnsolutions.com
Re: Hadoop streaming - No room for reduce task error
Hey Scott, It turns out that Alex's answer was mistaken - your error is actually coming from lack of disk space on the TT that has been assigned the reduce task. Specifically, there is not enough space in mapred.local.dir. You'll need to change your mapred.local.dir to point to a partition that has enough space to contain your reduce output. As for why this is the case, I hope someone will pipe up. It seems to me that reduce output can go directly to the target filesystem without using space on mapred.local.dir. Thanks -Todd On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard wrote: > What is mapred.child.ulimit set to? This configuration options specifics > how much memory child processes are allowed to have. You may want to up > this limit and see what happens. > > Let me know if that doesn't get you anywhere. > > Alex > > On Wed, Jun 10, 2009 at 9:40 AM, Scott wrote: > > > Complete newby map/reduce question here. I am using hadoop streaming as > I > > come from a Perl background, and am trying to prototype/test a process to > > load/clean-up ad server log lines from multiple input files into one > large > > file on the hdfs that can then be used as the source of a hive db table. > > I have a perl map script that reads an input line from stdin, does the > > needed cleanup/manipulation, and writes back to stdout.I don't really > > need a reduce step, as I don't care what order the lines are written in, > and > > there is no summary data to produce. When I run the job with -reducer > NONE > > I get valid output, however I get multiple part-x files rather than > one > > big file. > > So I wrote a trivial 'reduce' script that reads from stdin and simply > > splits the key/value, and writes the value back to stdout. > > > > I am executing the code as follows: > > > > ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper > > "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer > > "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input > /logs/*.log > > -output test9 > > > > The code I have works when given a small set of input files. However, I > > get the following error when attempting to run the code on a large set of > > input files: > > > > hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 > 15:43:00,905 > > WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. > Node > > tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920 > > bytes free; but we expect reduce input to take 22138478392 > > > > I assume this is because the all the map output is being buffered in > memory > > prior to running the reduce step? If so, what can I change to stop the > > buffering? I just need the map output to go directly to one large file. > > > > Thanks, > > Scott > > > > >
Re: Hadoop streaming - No room for reduce task error
What is mapred.child.ulimit set to? This configuration options specifics how much memory child processes are allowed to have. You may want to up this limit and see what happens. Let me know if that doesn't get you anywhere. Alex On Wed, Jun 10, 2009 at 9:40 AM, Scott wrote: > Complete newby map/reduce question here. I am using hadoop streaming as I > come from a Perl background, and am trying to prototype/test a process to > load/clean-up ad server log lines from multiple input files into one large > file on the hdfs that can then be used as the source of a hive db table. > I have a perl map script that reads an input line from stdin, does the > needed cleanup/manipulation, and writes back to stdout.I don't really > need a reduce step, as I don't care what order the lines are written in, and > there is no summary data to produce. When I run the job with -reducer NONE > I get valid output, however I get multiple part-x files rather than one > big file. > So I wrote a trivial 'reduce' script that reads from stdin and simply > splits the key/value, and writes the value back to stdout. > > I am executing the code as follows: > > ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper > "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer > "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input /logs/*.log > -output test9 > > The code I have works when given a small set of input files. However, I > get the following error when attempting to run the code on a large set of > input files: > > hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 15:43:00,905 > WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node > tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920 > bytes free; but we expect reduce input to take 22138478392 > > I assume this is because the all the map output is being buffered in memory > prior to running the reduce step? If so, what can I change to stop the > buffering? I just need the map output to go directly to one large file. > > Thanks, > Scott > >
Re: maybe a bug in hadoop?
Hi, > This seems to already be addressed by > https://issues.apache.org/jira/browse/HADOOP-2366 Thanks for the pointer, I just submitted a trivial patch for it. Best, Michele
[ANNOUNCE] hamake-1.1
I would like to announce maintenance release 1.1 of hamake (http://code.google.com/p/hamake/). It mostly includes bug fixes, optimizations and code cleanup. There was minor syntax changes in hamakefile format, which is now documented here: http://code.google.com/p/hamake/wiki/HaMakefileSyntax This release also include RPMs in addition to source tarball. 'hamake' utility allows you to automate incremental processing of datasets stored on HDFS using Hadoop tasks written in Java or using PigLatin scripts. Datasets could be either individual files or directories containing groups of files. New files may be added (or removed) at arbitrary location which may trigger recalculation of data depending on them. It is similar to unix 'make' utility. Sincerely, Vadim Zaliva
Re: Indexing on top of Hadoop
Hi, you might find some code in katta.sourceforge.net very helpful. Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com On Jun 10, 2009, at 5:49 AM, kartik saxena wrote: Hi, I have a huge LDIF file in order of GBs spanning some million user records. I am running the example "Grep" job on that file. The search results have not really been upto expectations because of it being a basic per line , brute force. I was thinking of building some indexes inside HDFS for that file , so that the search results could improve. What could I possibly try to achieve this? Secura
Re: Multiple NIC Cards
That is what I thought also, is that it needs to keep that information somewhere, because it needs to be able to communicate with all of the servers. So I deleted the /tmp/had* and /tmp/hs* directories, removed the log files, and grepped for the duey name in all files in config. And the problem still exists. Originally I thought that it might have had something to do with multiple entries in the .ssh/authorized_keys file but removed everything there. And the problem still existed. So I think that I am going to grab a new install of hadoop 0.19.1, delete the existing one and start out fresh to see if that changes anything. Wish me luck:) -John On Jun 10, 2009, at 12:30 PM, Steve Loughran wrote: John Martyniak wrote: Does hadoop "cache" the server names anywhere? Because I changed to using DNS for name resolution, but when I go to the nodes view, it is trying to view with the old name. And I changed the hadoop- site.xml file so that it no longer has any of those values. in SVN head, we try and get Java to tell us what is going on http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java This uses InetAddress.getLocalHost().getCanonicalHostName() to get the value, which is cached for life of the process. I don't know of anything else, but wouldn't be surprised -the Namenode has to remember the machines where stuff was stored. John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c: 303-522-1756 e: j...@beforedawnsoutions.com w: http://www.beforedawnsolutions.com
Re: Hadoop benchmarking
Take a look at Arun's slide deck on Hadoop performance: http://bit.ly/EDCg3 It is important to get io.sort.mb large enough, the io.sort.factor should be closer to 100 instead of 10. I'd also use large block sizes to reduce the number of maps. Please see the deck for other important factors. -- Owen
Where in the WebUI do we see setStatus and stderr output?
Hi, I am new to Hadoop and am using Pipes and Hadoop ver 0.20.0. Can someone tell me where in the web UI we see status messages set by TaskContext::setStatus and the stderr? Also is stdout captured somehwere? Thanks in advance, Roshan
Re: which Java version for hadoop-0.19.1 ?
Thanks again, Stuart. I definitely need to search better ... Best Roldano On Wed, Jun 10, 2009 at 07:06:58PM +0200, Stuart White wrote: > http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html#Required+Software > > > On Wed, Jun 10, 2009 at 12:02 PM, Roldano Cattoni wrote: > > > It works, many thanks. > > > > Last question: is this information documented somewhere in the package? I > > was not able to find it. > > > > > > Roldano > > > > > > > > On Wed, Jun 10, 2009 at 06:37:08PM +0200, Stuart White wrote: > > > Java 1.6. > > > > > > On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni > > wrote: > > > > > > > A very basic question: which Java version is required for > > hadoop-0.19.1? > > > > > > > > With jre1.5.0_06 I get the error: > > > > java.lang.UnsupportedClassVersionError: Bad version number in .class > > file > > > > at java.lang.ClassLoader.defineClass1(Native Method) > > > > (..) > > > > > > > > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06 > > > > > > > > > > > > Thanks in advance for your kind help > > > > > > > > Roldano > > > > > >
Re: which Java version for hadoop-0.19.1 ?
http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html#Required+Software On Wed, Jun 10, 2009 at 12:02 PM, Roldano Cattoni wrote: > It works, many thanks. > > Last question: is this information documented somewhere in the package? I > was not able to find it. > > > Roldano > > > > On Wed, Jun 10, 2009 at 06:37:08PM +0200, Stuart White wrote: > > Java 1.6. > > > > On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni > wrote: > > > > > A very basic question: which Java version is required for > hadoop-0.19.1? > > > > > > With jre1.5.0_06 I get the error: > > > java.lang.UnsupportedClassVersionError: Bad version number in .class > file > > > at java.lang.ClassLoader.defineClass1(Native Method) > > > (..) > > > > > > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06 > > > > > > > > > Thanks in advance for your kind help > > > > > > Roldano > > > >
Re: which Java version for hadoop-0.19.1 ?
It works, many thanks. Last question: is this information documented somewhere in the package? I was not able to find it. Roldano On Wed, Jun 10, 2009 at 06:37:08PM +0200, Stuart White wrote: > Java 1.6. > > On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni wrote: > > > A very basic question: which Java version is required for hadoop-0.19.1? > > > > With jre1.5.0_06 I get the error: > > java.lang.UnsupportedClassVersionError: Bad version number in .class file > > at java.lang.ClassLoader.defineClass1(Native Method) > > (..) > > > > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06 > > > > > > Thanks in advance for your kind help > > > > Roldano > >
Re: Indexing on top of Hadoop
We have built basic index support in CloudBase (a data warehouse on top of Hadoop- http://cloudbase.sourceforge.net/) and can share our experience here- The index we built is like a Hash Index- for a given column/field value, it tries to process only those data blocks which contain that value while ignoring rest of the blocks. As you would know, Hadoop stores data in the form of data blocks (64MB default size). During index creation stage ( a Map reduce job), we store the distinct column/field values along with their block information (node, filename, offset etc) in Hadoop's MapFile. A Map file is a SequenceFile (stores key,value pairs) which is sorted on keys. So you can see, it is like building an inverted index, where keys are the column/field values and posting lists are the lists containing blocks information. A Map file is not a very efficient persistent map, so you can store this inverted index in local file system using something like Lucene Index, but as your index grows you are consuming local space of Master node. Hence we decided to store it in HDFS using Map file. When you query using the column/field that was indexed, first this inverted index (MapFile) is consulted to find all the blocks that contain the desired value and InputSplits are created. As you would know a Mapper works on Input Split and in normal cases, Hadoop will schedule jobs that will work on all input splits but in this case, we have written our own InputFormat that will schedule jobs only on required input splits. We have measured performance of this approach in some cases we have got 97% improvement (we indexed on date field and ran queries on year's log files but fetching only one or few particular day(s) of data). Now some caveats- 1) If the column, field that you are indexing contain a large number of distinct values, then your inverted index (the MapFile) is gonna bloat up and this file is looked up before any Map Reduce job is started so this means this code runs on master node. This can become very slow. For example, if you have query logs and you index query column/field, then the size of Map file can become really huge. Using Lucene Index on local file sytem can speed up the lookup. 2) It does not work on range queries like column < 10 etc. To solve this, we store the min and max value of the column/field found in a block also. That is for each block, min and max value of the column/field is stored and when we encounter range query, this list is scanned to eliminate some blocks. Another indexing approaches we have thougth of- To solve problem 1) where the size of inverted index (MapFile) becomes huge, we can use an approach called BloomIndexing. This approach makes use of BloomFilters (space-efficient probabilistic data structures that is used to test whether an element is a member of a set or not). For each data block, a bloom filter is constructed. For 64MB, 128MB, 256MB data block sizes, the bloom filter will be extremely small. During index creation stage (Map job, no reducer), a mapper reads the blok/input split and creates a bloom fitler for the column/filed on which you want to create your index and stores it in HDFS. During query phase, before processing of the input split/data block first the corresponding bloom filter is consulted to see if the data block contains the desired value or not. As you would know a bloom filter will never give false negative, so if bloom test fails, you can safely ignore processing of that block. This will save you time that you would have spent processing each row/line in the data block. Advantages of this approach- As compared to HashIndexing, this apporach scatters the block selection logic in Map Reduce job so master node is not overloaded to scan huge inverted index. Disadvantages of this approach- You still have to schedule as many mappers as the number of input splits/data blocks and starting JVMs incur overheads, however since hadoop-0.19 you can use "reuse jvm flag" to avoid some overheads. Further you can increase your block size to 128 or 256MB that will give you considerable performance improvent. Hope this helps, Tarandeep On Wed, Jun 10, 2009 at 5:49 AM, kartik saxena wrote: > Hi, > > I have a huge LDIF file in order of GBs spanning some million user > records. > I am running the example "Grep" job on that file. The search results have > not really been > upto expectations because of it being a basic per line , brute force. > > I was thinking of building some indexes inside HDFS for that file , so that > the search results could improve. What could I possibly try to achieve > this? > > > Secura >
Hadoop streaming - No room for reduce task error
Complete newby map/reduce question here. I am using hadoop streaming as I come from a Perl background, and am trying to prototype/test a process to load/clean-up ad server log lines from multiple input files into one large file on the hdfs that can then be used as the source of a hive db table. I have a perl map script that reads an input line from stdin, does the needed cleanup/manipulation, and writes back to stdout.I don't really need a reduce step, as I don't care what order the lines are written in, and there is no summary data to produce. When I run the job with -reducer NONE I get valid output, however I get multiple part-x files rather than one big file. So I wrote a trivial 'reduce' script that reads from stdin and simply splits the key/value, and writes the value back to stdout. I am executing the code as follows: ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input /logs/*.log -output test9 The code I have works when given a small set of input files. However, I get the following error when attempting to run the code on a large set of input files: hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 15:43:00,905 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920 bytes free; but we expect reduce input to take 22138478392 I assume this is because the all the map output is being buffered in memory prior to running the reduce step? If so, what can I change to stop the buffering? I just need the map output to go directly to one large file. Thanks, Scott
Re: which Java version for hadoop-0.19.1 ?
Java 1.6. On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni wrote: > A very basic question: which Java version is required for hadoop-0.19.1? > > With jre1.5.0_06 I get the error: > java.lang.UnsupportedClassVersionError: Bad version number in .class file > at java.lang.ClassLoader.defineClass1(Native Method) > (..) > > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06 > > > Thanks in advance for your kind help > > Roldano >
which Java version for hadoop-0.19.1 ?
A very basic question: which Java version is required for hadoop-0.19.1? With jre1.5.0_06 I get the error: java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) (..) By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06 Thanks in advance for your kind help Roldano
Re: Multiple NIC Cards
John Martyniak wrote: Does hadoop "cache" the server names anywhere? Because I changed to using DNS for name resolution, but when I go to the nodes view, it is trying to view with the old name. And I changed the hadoop-site.xml file so that it no longer has any of those values. in SVN head, we try and get Java to tell us what is going on http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java This uses InetAddress.getLocalHost().getCanonicalHostName() to get the value, which is cached for life of the process. I don't know of anything else, but wouldn't be surprised -the Namenode has to remember the machines where stuff was stored.
Re: Hadoop benchmarking
Hi Stephen, That will set the maximum heap allowable, but doesn't tell Hadoop's internal systems necessarily to take advantage of it. There's a number of other settings that adjust performance. At Cloudera we have a config tool that generates Hadoop configurations with reasonable first-approximation values for your cluster -- check out http://my.cloudera.com and look at the hadoop-site.xml it generates. If you start from there you might find a better parameter space to explore. Please share back your findings -- we'd love to tweak the tool even more with some external feedback :) - Aaron On Wed, Jun 10, 2009 at 7:39 AM, stephen mulcahy wrote: > Hi, > > I'm currently doing some testing of different configurations using the > Hadoop Sort as follows, > > bin/hadoop jar hadoop-*-examples.jar randomwriter > -Dtest.randomwrite.total_bytes=107374182400 /benchmark100 > > bin/hadoop jar hadoop-*-examples.jar sort /benchmark100 rand-sort > > The only changes I've made from the standard config are the following in > conf/mapred-site.xml > > > mapred.child.java.opts > -Xmx1024M > > > > mapred.tasktracker.map.tasks.maximum > 8 > > > > mapred.tasktracker.reduce.tasks.maximum > 4 > > > I'm running this on 4 systems, each with 8 processor cores and 4 separate > disks. > > Is there anything else I should change to stress memory more? The systems > in questions have 16GB of memory but the most thats getting used during a > run of this benchmark is about 2GB (and most of that seems to be os > caching). > > Thanks, > > -stephen > > -- > Stephen Mulcahy, DI2, Digital Enterprise Research Institute, > NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland > http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com >
Re: Help - ClassNotFoundException from tasktrackers
mapred.local.dir can definitely work with multiple paths, but it's a bit strict about the format. You should have one or more paths, separated by commas and no whitespace. e.g. "/disk1/mapred,/disk2/mapred". If you've got them on new lines, etc, then it might try to interpret that as part of one of the paths. Could this be your issue? If you paste your hadoop-site.xml in to an email we could take a look. - Aaron On Wed, Jun 10, 2009 at 2:41 AM, Harish Mallipeddi < harish.mallipe...@gmail.com> wrote: > Hi, > > After some amount of debugging, I narrowed down the problem. I'd set > multiple paths (multiple disks mounted at different endpoints) for the * > mapred.local.dir* setting. When I changed this to just one path, everything > started working - previously the job jar and the job xml weren't being > created under the local/taskTracker/jobCache folder. > > Any idea why? Am I doing something wrong? I've read somewhere else that > specifying multiple disks for mapred.local.dir is important to increase > disk > bandwidth so that the map outputs get written faster to the local disks on > the tasktracker nodes. > > Cheers, > Harish > > On Tue, Jun 9, 2009 at 5:21 PM, Harish Mallipeddi < > harish.mallipe...@gmail.com> wrote: > > > I setup a new cluster (1 namenode + 2 datanodes). I'm trying to run the > > GenericMRLoadGenerator program from hadoop-0.20.0-test.jar. But I keep > > getting the ClassNotFoundException. Any reason why this would be > happening? > > It seems to me like the tasktrackers cannot find the class files from the > > hadoop program but when you do "hadoop jar", it will automatically ship > the > > job jar file to all the tasktracker nodes and put them in the classpath? > > > > $ *bin/hadoop jar hadoop-0.20.0-test.jar > loadgen*-Dtest.randomtextwrite.bytes_per_map=1048576 > > -Dhadoop.sort.reduce.keep.percent=50.0 -outKey org.apache.hadoop.io.Text > > -outValue org.apache.hadoop.io.Text > > > > No input path; ignoring InputFormat > > Job started: Tue Jun 09 03:16:30 PDT 2009 > > 09/06/09 03:16:30 INFO mapred.JobClient: Running job: > job_200906090316_0001 > > 09/06/09 03:16:31 INFO mapred.JobClient: map 0% reduce 0% > > 09/06/09 03:16:40 INFO mapred.JobClient: Task Id : > > attempt_200906090316_0001_m_00_0, Status : FAILED > > java.io.IOException: Split class > > > org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit > > not found > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:324) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > *Caused by: java.lang.ClassNotFoundException: > > > org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit > > * > > at java.net.URLClassLoader$1.run(URLClassLoader.java:200) > > at java.security.AccessController.doPrivileged(Native Method) > > at java.net.URLClassLoader.findClass(URLClassLoader.java:188) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:307) > > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:252) > > at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) > > at java.lang.Class.forName0(Native Method) > > at java.lang.Class.forName(Class.java:247) > > at > > > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:761) > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:321) > > ... 2 more > > [more of the same exception from different map tasks...I've removed them] > > > > Actually the same thing happens even if I run the PiEstimator program > from > > hadoop-0.20.0-examples.jar. > > > > Thanks, > > > > -- > > Harish Mallipeddi > > http://blog.poundbang.in > > > > > > -- > Harish Mallipeddi > http://blog.poundbang.in >
Re: Multiple NIC Cards
Does hadoop "cache" the server names anywhere? Because I changed to using DNS for name resolution, but when I go to the nodes view, it is trying to view with the old name. And I changed the hadoop-site.xml file so that it no longer has any of those values. Any help would be appreciated. Thank you, -John On Jun 9, 2009, at 9:24 PM, John Martyniak wrote: So I setup a dns server that is for the internal network. changed all of the names to duey.local, and created a master zone for .local on the DNS. Put the domains server as the first one in /etc/ resolv.conf file, added it to the interface. I changed the hostname of the machine that it is running on from duey..com to duey.local. Checked that the dns resolves, and it does. Ran nslookup and returns the name of the machine given the ip address. changed all of the names from the IP Addresses to duey.local, in my hadoop-site.xml, changed the names in the masters and slaves files. Deleted all of the logs, deleted the /tmp directory stuff. Then restarted hadoop. And much to my surprise.it still didn't work. I really thought that this would work as it seems to be the consensus that the issue is the resolution of the name. Any other thoughts would be greatly appreciated. -John On Jun 9, 2009, at 3:17 PM, Raghu Angadi wrote: I still need to go through the whole thread. but we feel your pain. First, please try setting fs.default.name to namenode internal ip on the datanodes. This should make NN to attach internal ip so the datanodes (assuming your routing is correct). NameNode webUI should list internal ips for datanode. You might have to temporarily change NameNode code to listen on 0.0.0.0. That said, The issues you are facing are pretty unfortunate. As Steve mentioned Hadoop is all confused about hostname/ip and there is unecessary reliance on hostname and reverse DNS look ups in many many places. At least fairly straight fwd set ups with multiple NICs should be handled well. dfs.datanode.dns.interface should work like you expected (but not very surprised it didn't). Another thing you could try is setting dfs.datanode.address to the internal ip address (this might already be discussed in the thread). This should at least get all the bulk datatransfers happen over internal NICs. One way to make sure is to hover on the datanode node on NameNode webUI.. it shows the ip address. good luck. It might be better document your pains and findings in a Jira (with most of the details in one or more comments rather than in description). Raghu. John Martyniak wrote: So I changed all of the 0.0.0.0 on one machine to point to the 192.168.1.102 address. And still it picks up the hostname and ip address of the external network. I am kind of at my wits end with this, as I am not seeing a solution yet, except to take the machines off of the external network and leave them on the internal network which isn't an option. Has anybody had this problem before? What was the solution? -John On Jun 9, 2009, at 10:17 AM, Steve Loughran wrote: One thing to consider is that some of the various services of Hadoop are bound to 0:0:0:0, which means every Ipv4 address, you really want to bring up everything, including jetty services, on the en0 network adapter, by binding them to 192.168.1.102; this will cause anyone trying to talk to them over the other network to fail, which at least find the problem sooner rather than later John Martyniak President/CEO Before Dawn Solutions, Inc. 9457 S. University Blvd #266 Highlands Ranch, CO 80126 o: 877-499-1562 c: 303-522-1756 e: j...@beforedawnsoutions.com w: http://www.beforedawnsolutions.com
Re: speedy Google Maps driving directions like implementation
Hello, On 10.06.2009, at 11:43, Lukáš Vlček wrote: Hi, I am wondering how Google implemented the driving directions function in the Maps. More specifically how did they do it that it is so fast. I asked on Google engineer about this and all he told me is just that there are bunch of MapReduce cycles involved in this process but I don't think it is really close to the truth. How it is possible to implement such real-time function in plain MapReduce fashion (from the Hadoop point of view)? The only possibility I can think of right now it that they either execute the MapReduce computation only in the memory (intermediate results as well as Reduce to next Map results are kept only in some-kind of distributed memory) or they use other architecture for this. In simple words I know there are some tutorials about how to nail down SSSP problem with MapReduce but I can not believe it can produce results in such a quick response I can experience with Google Maps. Any comments, ideas? I don't think they use MapReduce for the user-facing part of Google Maps. If I had to implement something similar, I would use MapReduce jobs to preform as much as "offline"-processing as possible. Therefore my guess is, that MapReduce-Jobs are used to create the static map tiles from the huge satellite imagery. I also think that MapReduce jobs create an index or a graph representation of roads and their intersections with other roads, including intersections with towns and other points of interest. The resulting graph is probably stored in a Memcache type memory system for very fast retrieval. But then again, I'm not a GIS expert (not even close). ;-) J --
Re: maybe a bug in hadoop?
Hi, This seems to already be addressed by https://issues.apache.org/jira/browse/HADOOP-2366 -stephen Brian Bockelman wrote: Hey Stephen, I've hit this "bug" before (rather, our admins did...). I would be happy to see you file it - after checking for duplicates - so I no longer have to warn people about it. Brian On Jun 10, 2009, at 6:29 AM, stephen mulcahy wrote: Tim Wintle wrote: I thought I'd double check... $touch \ file\ with\ leading\ space $touch file\ without\ leading\ space $ ls file with leading space file without leading space Oh sure, Linux will certainly allow you to create files/dirs with such names (and better .. hows about touch \.\.\. or similar (which used to be a favourite of guys who had managed to exploit bad permissions on ftp sites to upload material of dubious copyright). But is that the intention in the Hadoop config or is it a usability oversight? Hence my mail here rather than raising a bug immediately. ... so stripping it out could mean you couldn't enter some valid directory names (but who has folders starting with a space?) Well ... I do now ;) -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Hadoop benchmarking
Hi, I'm currently doing some testing of different configurations using the Hadoop Sort as follows, bin/hadoop jar hadoop-*-examples.jar randomwriter -Dtest.randomwrite.total_bytes=107374182400 /benchmark100 bin/hadoop jar hadoop-*-examples.jar sort /benchmark100 rand-sort The only changes I've made from the standard config are the following in conf/mapred-site.xml mapred.child.java.opts -Xmx1024M mapred.tasktracker.map.tasks.maximum 8 mapred.tasktracker.reduce.tasks.maximum 4 I'm running this on 4 systems, each with 8 processor cores and 4 separate disks. Is there anything else I should change to stress memory more? The systems in questions have 16GB of memory but the most thats getting used during a run of this benchmark is about 2GB (and most of that seems to be os caching). Thanks, -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: HDFS issues..!
On Wed, Jun 10, 2009 at 4:55 AM, Sugandha Naolekar wrote: > If I want to make the data transfer fast, then what am I supposed > to do? I want to place the data in HDFS and replicate it in fraction of > seconds. I want to go to France, but it takes 10+ hours to get there from California on the fastest plane. How can I get there faster? > Can that be possible. and How? Placing a 5GB file will take atleast > half n hour...or so...but, if its a large cluster, lets say, of 7nodes, and > then placing it in HDFS would take around 2-3 hours. So, how that time > delay > can be avoided..? > HDFS will only replicate as many times as you want it to. The write is also pipelined. This means that writing a 5G file that is replicated to 3 nodes is only marginally faster than the same file on 10 nodes, if for some reason you wanted to set your replication count to 10 (unnecessary for 99.9% of use cases) > > Also, My simply aim is to transfer the data, i.e; dumping the data > into HDFS and gettign it back whenever needed. So, for this, transfer, how > speed can be achieved? HDFS isn't magic. You can only write as fast as your disk and network can. If your disk has 50MB/sec of throughput, you'll probably be limited at 50MB/sec. Expecting much more than this in real life scenarios is unrealistic. -Todd
Re: HDFS data transfer!
Hey Sugandha, Transfer rates depend on the quality/quantity of your hardware and the quality of your client disk that is generating the data. I usually say that you should expect near-hardware-bottleneck speeds for an otherwise idle cluster. There should be no "make it fast" required (though you should reviewi the logs for errors if it's going slow). I would expect a 5GB file to take around 3-5 minutes to write on our cluster, but it's a well-tuned and operational cluster. As Todd (I think) mentioned before, we can't help any when you say "I want to make it faster". You need to provide diagnostic information - logs, Ganglia plots, stack traces, something - that folks can look at. Brian On Jun 10, 2009, at 2:25 AM, Sugandha Naolekar wrote: But if I want to make it fast, then??? I want to place the data in HDFS and reoplicate it in fraction of seconds. Can that be possible. and How? On Wed, Jun 10, 2009 at 2:47 PM, kartik saxena wrote: I would suppose about 2-3 hours. It took me some 2 days to load a 160 Gb file. Secura On Wed, Jun 10, 2009 at 11:56 AM, Sugandha Naolekar wrote:It Hello! If I try to transfer a 5GB VDI file from a remote host(not a part of hadoop cluster) into HDFS, and get it back, how much time is it supposed to take? No map-reduce involved. Simply Writing files in and out from HDFS through a simple code of java (usage of API's). -- Regards! Sugandha -- Regards! Sugandha
Re: maybe a bug in hadoop?
Hey Stephen, I've hit this "bug" before (rather, our admins did...). I would be happy to see you file it - after checking for duplicates - so I no longer have to warn people about it. Brian On Jun 10, 2009, at 6:29 AM, stephen mulcahy wrote: Tim Wintle wrote: I thought I'd double check... $touch \ file\ with\ leading\ space $touch file\ without\ leading\ space $ ls file with leading space file without leading space Oh sure, Linux will certainly allow you to create files/dirs with such names (and better .. hows about touch \.\.\. or similar (which used to be a favourite of guys who had managed to exploit bad permissions on ftp sites to upload material of dubious copyright). But is that the intention in the Hadoop config or is it a usability oversight? Hence my mail here rather than raising a bug immediately. ... so stripping it out could mean you couldn't enter some valid directory names (but who has folders starting with a space?) Well ... I do now ;) -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: maybe a bug in hadoop?
Tim Wintle wrote: I thought I'd double check... $touch \ file\ with\ leading\ space $touch file\ without\ leading\ space $ ls file with leading space file without leading space Oh sure, Linux will certainly allow you to create files/dirs with such names (and better .. hows about touch \.\.\. or similar (which used to be a favourite of guys who had managed to exploit bad permissions on ftp sites to upload material of dubious copyright). But is that the intention in the Hadoop config or is it a usability oversight? Hence my mail here rather than raising a bug immediately. ... so stripping it out could mean you couldn't enter some valid directory names (but who has folders starting with a space?) Well ... I do now ;) -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: maybe a bug in hadoop?
On Wed, 2009-06-10 at 13:56 +0100, stephen mulcahy wrote: > Hi, > > I've been running some tests with hadoop in pseudo-distributed mode. My > config includes the following in conf/hdfs-site.xml > > > dfs.data.dir > /hdfs/disk1, /hdfs/disk2, /hdfs/disk3, /hdfs/disk4 > > > When I started running hadoop I was disappointed to see that only > /hdfs/disk1 was created. It was only later I noticed that HADOOP_HOME > now contained the following new directories > > HADOOP_HOME/ /hdfs/disk2 > HADOOP_HOME/ /hdfs/disk3 > HADOOP_HOME/ /hdfs/disk4 > > I guess any leading spaces should be stripped out of the data dir names? > Or maybe there is a reason for this behaviour. I thought I should > mention it just in case. I thought I'd double check... $touch \ file\ with\ leading\ space $touch file\ without\ leading\ space $ ls file with leading space file without leading space ... so stripping it out could mean you couldn't enter some valid directory names (but who has folders starting with a space?) I'm sure my input wasn't very useful, but just a comment. Tim Wintle
maybe a bug in hadoop?
Hi, I've been running some tests with hadoop in pseudo-distributed mode. My config includes the following in conf/hdfs-site.xml dfs.data.dir /hdfs/disk1, /hdfs/disk2, /hdfs/disk3, /hdfs/disk4 When I started running hadoop I was disappointed to see that only /hdfs/disk1 was created. It was only later I noticed that HADOOP_HOME now contained the following new directories HADOOP_HOME/ /hdfs/disk2 HADOOP_HOME/ /hdfs/disk3 HADOOP_HOME/ /hdfs/disk4 I guess any leading spaces should be stripped out of the data dir names? Or maybe there is a reason for this behaviour. I thought I should mention it just in case. Thanks, -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Indexing on top of Hadoop
Hi, I have a huge LDIF file in order of GBs spanning some million user records. I am running the example "Grep" job on that file. The search results have not really been upto expectations because of it being a basic per line , brute force. I was thinking of building some indexes inside HDFS for that file , so that the search results could improve. What could I possibly try to achieve this? Secura
Re: Large size Text file split
There is always NLineInputFormat. You specify the number of lines per split. The key is the position of the line start in the file, value is the line itself. The parameter mapred.line.input.format.linespermap controls the number of lines per split On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi < harish.mallipe...@gmail.com> wrote: > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo > wrote: > > > Hi, all > > > > I have a large csv file ( larger than 10 GB ), I'd like to use a certain > > InputFormat to split it into smaller part thus each Mapper can deal with > > piece of the csv file. However, as far as I know, FileInputFormat only > > cares about byte size of file, that is, the class can divide the csv > > file as many part, and maybe some part is not a well-format CVS file. > > For example, one line of the CSV file is not terminated with CRLF, or > > maybe some text is trimed. > > > > How to ensure each FileSplit is a smaller valid CSV file using a proper > > InputFormat? > > > > BR/anderson > > > > If all you care about is the splits occurring at line boundaries, then > TextInputFormat will work. > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html > > If not I guess you can write your own InputFormat class. > > -- > Harish Mallipeddi > http://blog.poundbang.in > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Large size Text file split
On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wrote: > Hi, all > > I have a large csv file ( larger than 10 GB ), I'd like to use a certain > InputFormat to split it into smaller part thus each Mapper can deal with > piece of the csv file. However, as far as I know, FileInputFormat only > cares about byte size of file, that is, the class can divide the csv > file as many part, and maybe some part is not a well-format CVS file. > For example, one line of the CSV file is not terminated with CRLF, or > maybe some text is trimed. > > How to ensure each FileSplit is a smaller valid CSV file using a proper > InputFormat? > > BR/anderson > If all you care about is the splits occurring at line boundaries, then TextInputFormat will work. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html If not I guess you can write your own InputFormat class. -- Harish Mallipeddi http://blog.poundbang.in
Large size Text file split
Hi, all I have a large csv file ( larger than 10 GB ), I'd like to use a certain InputFormat to split it into smaller part thus each Mapper can deal with piece of the csv file. However, as far as I know, FileInputFormat only cares about byte size of file, that is, the class can divide the csv file as many part, and maybe some part is not a well-format CVS file. For example, one line of the CSV file is not terminated with CRLF, or maybe some text is trimed. How to ensure each FileSplit is a smaller valid CSV file using a proper InputFormat? BR/anderson
HDFS issues..!
If I want to make the data transfer fast, then what am I supposed to do? I want to place the data in HDFS and replicate it in fraction of seconds. Can that be possible. and How? Placing a 5GB file will take atleast half n hour...or so...but, if its a large cluster, lets say, of 7nodes, and then placing it in HDFS would take around 2-3 hours. So, how that time delay can be avoided..? Also, My simply aim is to transfer the data, i.e; dumping the data into HDFS and gettign it back whenever needed. So, for this, transfer, how speed can be achieved? -- Regards! Sugandha
speedy Google Maps driving directions like implementation
Hi, I am wondering how Google implemented the driving directions function in the Maps. More specifically how did they do it that it is so fast. I asked on Google engineer about this and all he told me is just that there are bunch of MapReduce cycles involved in this process but I don't think it is really close to the truth. How it is possible to implement such real-time function in plain MapReduce fashion (from the Hadoop point of view)? The only possibility I can think of right now it that they either execute the MapReduce computation only in the memory (intermediate results as well as Reduce to next Map results are kept only in some-kind of distributed memory) or they use other architecture for this. In simple words I know there are some tutorials about how to nail down SSSP problem with MapReduce but I can not believe it can produce results in such a quick response I can experience with Google Maps. Any comments, ideas? Thanks, Lukas
Re: Help - ClassNotFoundException from tasktrackers
Hi, After some amount of debugging, I narrowed down the problem. I'd set multiple paths (multiple disks mounted at different endpoints) for the * mapred.local.dir* setting. When I changed this to just one path, everything started working - previously the job jar and the job xml weren't being created under the local/taskTracker/jobCache folder. Any idea why? Am I doing something wrong? I've read somewhere else that specifying multiple disks for mapred.local.dir is important to increase disk bandwidth so that the map outputs get written faster to the local disks on the tasktracker nodes. Cheers, Harish On Tue, Jun 9, 2009 at 5:21 PM, Harish Mallipeddi < harish.mallipe...@gmail.com> wrote: > I setup a new cluster (1 namenode + 2 datanodes). I'm trying to run the > GenericMRLoadGenerator program from hadoop-0.20.0-test.jar. But I keep > getting the ClassNotFoundException. Any reason why this would be happening? > It seems to me like the tasktrackers cannot find the class files from the > hadoop program but when you do "hadoop jar", it will automatically ship the > job jar file to all the tasktracker nodes and put them in the classpath? > > $ *bin/hadoop jar hadoop-0.20.0-test.jar > loadgen*-Dtest.randomtextwrite.bytes_per_map=1048576 > -Dhadoop.sort.reduce.keep.percent=50.0 -outKey org.apache.hadoop.io.Text > -outValue org.apache.hadoop.io.Text > > No input path; ignoring InputFormat > Job started: Tue Jun 09 03:16:30 PDT 2009 > 09/06/09 03:16:30 INFO mapred.JobClient: Running job: job_200906090316_0001 > 09/06/09 03:16:31 INFO mapred.JobClient: map 0% reduce 0% > 09/06/09 03:16:40 INFO mapred.JobClient: Task Id : > attempt_200906090316_0001_m_00_0, Status : FAILED > java.io.IOException: Split class > org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit > not found > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:324) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > *Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit > * > at java.net.URLClassLoader$1.run(URLClassLoader.java:200) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:188) > at java.lang.ClassLoader.loadClass(ClassLoader.java:307) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > at java.lang.ClassLoader.loadClass(ClassLoader.java:252) > at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:247) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:761) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:321) > ... 2 more > [more of the same exception from different map tasks...I've removed them] > > Actually the same thing happens even if I run the PiEstimator program from > hadoop-0.20.0-examples.jar. > > Thanks, > > -- > Harish Mallipeddi > http://blog.poundbang.in > -- Harish Mallipeddi http://blog.poundbang.in
Re: HDFS data transfer!
But if I want to make it fast, then??? I want to place the data in HDFS and reoplicate it in fraction of seconds. Can that be possible. and How? On Wed, Jun 10, 2009 at 2:47 PM, kartik saxena wrote: > I would suppose about 2-3 hours. It took me some 2 days to load a 160 Gb > file. > Secura > > On Wed, Jun 10, 2009 at 11:56 AM, Sugandha Naolekar > wrote:It > > > Hello! > > > > If I try to transfer a 5GB VDI file from a remote host(not a part of > hadoop > > cluster) into HDFS, and get it back, how much time is it supposed to > take? > > > > No map-reduce involved. Simply Writing files in and out from HDFS through > a > > simple code of java (usage of API's). > > > > -- > > Regards! > > Sugandha > > > -- Regards! Sugandha
Re: Can not build Hadoop in CentOS 64bit
It works, thanks On Wed, Jun 10, 2009 at 2:28 PM, Todd Lipcon wrote: > Hi Ian, > > Recent versions of Hadoop require Java 1.6. You will not be able to > successfully compile on Java 1.5 >
Re: HDFS data transfer!
I would suppose about 2-3 hours. It took me some 2 days to load a 160 Gb file. Secura On Wed, Jun 10, 2009 at 11:56 AM, Sugandha Naolekar wrote:It > Hello! > > If I try to transfer a 5GB VDI file from a remote host(not a part of hadoop > cluster) into HDFS, and get it back, how much time is it supposed to take? > > No map-reduce involved. Simply Writing files in and out from HDFS through a > simple code of java (usage of API's). > > -- > Regards! > Sugandha >