Re: Hadoop streaming - No room for reduce task error

2009-06-10 Thread jason hadoop
The reduce output may spill to disk during the sort, and if it expected to
be larger than the partition free space, unless the machine/jvm has a hugh
allowed memory space, the data will spill to disk during the sort.
If I did my math correctly, you are trying to push ~2TB through the single
reduce.

as for the part- files, if you have the number of reduces set to zero,
you will get N part files, where N is the number of map tasks.

If you absolutely must have it all go to one reduce, you will need to
increase the free disk space. I think 19.1 preserves compression for the map
output, so you could try enabling compression for map output.

If you have many nodes, you can set the number of reduces to some number and
then use sort -M on the part files, to merge sort them, assuming your reduce
preserves ordering.

Try adding these parameters to your job line:
-D mapred.compress.map.output=true -D mapred.output.compression.type=BLOCK

BTW, /bin/cat works fine as an identity mapper or an identity reducer


On Wed, Jun 10, 2009 at 5:31 PM, Todd Lipcon  wrote:

> Hey Scott,
> It turns out that Alex's answer was mistaken - your error is actually
> coming
> from lack of disk space on the TT that has been assigned the reduce task.
> Specifically, there is not enough space in mapred.local.dir. You'll need to
> change your mapred.local.dir to point to a partition that has enough space
> to contain your reduce output.
>
> As for why this is the case, I hope someone will pipe up. It seems to me
> that reduce output can go directly to the target filesystem without using
> space on mapred.local.dir.
>
> Thanks
> -Todd
>
> On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard 
> wrote:
>
> > What is mapred.child.ulimit set to?  This configuration options specifics
> > how much memory child processes are allowed to have.  You may want to up
> > this limit and see what happens.
> >
> > Let me know if that doesn't get you anywhere.
> >
> > Alex
> >
> > On Wed, Jun 10, 2009 at 9:40 AM, Scott  wrote:
> >
> > > Complete newby map/reduce question here.  I am using hadoop streaming
> as
> > I
> > > come from a Perl background, and am trying to prototype/test a process
> to
> > > load/clean-up ad server log lines from multiple input files into one
> > large
> > > file on the hdfs that can then be used as the source of a hive db
> table.
> > > I have a perl map script that reads an input line from stdin, does the
> > > needed cleanup/manipulation, and writes back to stdout.I don't
> really
> > > need a reduce step, as I don't care what order the lines are written
> in,
> > and
> > > there is no summary data to produce.  When I run the job with -reducer
> > NONE
> > > I get valid output, however I get multiple part-x files rather than
> > one
> > > big file.
> > > So I wrote a trivial 'reduce' script that reads from stdin and simply
> > > splits the key/value, and writes the value back to stdout.
> > >
> > > I am executing the code as follows:
> > >
> > > ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
> > > "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
> > > "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
> > /logs/*.log
> > > -output test9
> > >
> > > The code I have works when given a small set of input files.  However,
> I
> > > get the following error when attempting to run the code on a large set
> of
> > > input files:
> > >
> > > hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09
> > 15:43:00,905
> > > WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task.
> > Node
> > > tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has
> 2004049920
> > > bytes free; but we expect reduce input to take 22138478392
> > >
> > > I assume this is because the all the map output is being buffered in
> > memory
> > > prior to running the reduce step?  If so, what can I change to stop the
> > > buffering?  I just need the map output to go directly to one large
> file.
> > >
> > > Thanks,
> > > Scott
> > >
> > >
> >
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


RE: Large size Text file split

2009-06-10 Thread Wenrui Guo
I don't understand the internal logic of the FileSplit and Mapper.

By my understanding, I think FileInputFormat is the actual class that
takes care of the file spliting. So it's reasonable if one large file is
splited into 5 smaller parts, each parts is less than 2GB (since we
specify the numberOfSplit is 5).

However, the FileSplit is rough edges, so mapper 1 which takes the split
1 as input omit the incomplete parts at the end of split 1, then mapper
2 will continue to read that incomplete part then add the remaining part
of split 2?

Take this as example:

The original file is:

1::122::5::838985046 (CRLF)
1::185::5::838983525 (CRLF)
1::231::5::838983392 (CRLF)

Assume number of split is 2, then the above content is divied into two
part:

Split 1:
1::122::5::838985046 (CRLF)
1::185::5::8
 

Split 2:
38983525 (CRLF)
1::231::5::838983392 (CRLF)

Afterwards, Mapper 1 takes split 1 as input, but after eat the line
1::122::5::838985046, it found the remaining part is not a complete
record, then Mapper 1 bypass it, but Mapper 2 will read this and add it
ahead of first line of Split 2 to compose a valid record.

Is it correct ? If it is, which class implements the above logic?

BR/anderson

-Original Message-
From: Aaron Kimball [mailto:aa...@cloudera.com] 
Sent: Thursday, June 11, 2009 11:49 AM
To: core-user@hadoop.apache.org
Subject: Re: Large size Text file split

The FileSplit boundaries are "rough" edges -- the mapper responsible for
the previous split will continue until it finds a full record, and the
next mapper will read ahead and only start on the first record boundary
after the byte offset.
- Aaron

On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo 
wrote:

> I think the default TextInputFormat can meet my requirement. However, 
> even if the JavaDoc of TextInputFormat says the TextInputFormat could 
> divide input file as text lines which ends with CRLF. But I'd like to 
> know if the FileSplit size is not N times of line length, what will be

> happen eventually?
>
> BR/anderson
>
> -Original Message-
> From: jason hadoop [mailto:jason.had...@gmail.com]
> Sent: Wednesday, June 10, 2009 8:39 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Large size Text file split
>
> There is always NLineInputFormat. You specify the number of lines per 
> split.
> The key is the position of the line start in the file, value is the 
> line itself.
> The parameter mapred.line.input.format.linespermap controls the number

> of lines per split
>
> On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi < 
> harish.mallipe...@gmail.com> wrote:
>
> > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo 
> > 
> > wrote:
> >
> > > Hi, all
> > >
> > > I have a large csv file ( larger than 10 GB ), I'd like to use a 
> > > certain InputFormat to split it into smaller part thus each Mapper

> > > can deal with piece of the csv file. However, as far as I know, 
> > > FileInputFormat only cares about byte size of file, that is, the 
> > > class can divide the csv file as many part, and maybe some part is
> not a well-format CVS file.
> > > For example, one line of the CSV file is not terminated with CRLF,

> > > or maybe some text is trimed.
> > >
> > > How to ensure each FileSplit is a smaller valid CSV file using a 
> > > proper InputFormat?
> > >
> > > BR/anderson
> > >
> >
> > If all you care about is the splits occurring at line boundaries, 
> > then
>
> > TextInputFormat will work.
> >
> > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/map
> > re
> > d/TextInputFormat.html
> >
> > If not I guess you can write your own InputFormat class.
> >
> > --
> > Harish Mallipeddi
> > http://blog.poundbang.in
> >
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>


Re: Large size Text file split

2009-06-10 Thread Aaron Kimball
The FileSplit boundaries are "rough" edges -- the mapper responsible for the
previous split will continue until it finds a full record, and the next
mapper will read ahead and only start on the first record boundary after the
byte offset.
- Aaron

On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo  wrote:

> I think the default TextInputFormat can meet my requirement. However,
> even if the JavaDoc of TextInputFormat says the TextInputFormat could
> divide input file as text lines which ends with CRLF. But I'd like to
> know if the FileSplit size is not N times of line length, what will be
> happen eventually?
>
> BR/anderson
>
> -Original Message-
> From: jason hadoop [mailto:jason.had...@gmail.com]
> Sent: Wednesday, June 10, 2009 8:39 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Large size Text file split
>
> There is always NLineInputFormat. You specify the number of lines per
> split.
> The key is the position of the line start in the file, value is the line
> itself.
> The parameter mapred.line.input.format.linespermap controls the number
> of lines per split
>
> On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi <
> harish.mallipe...@gmail.com> wrote:
>
> > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo 
> > wrote:
> >
> > > Hi, all
> > >
> > > I have a large csv file ( larger than 10 GB ), I'd like to use a
> > > certain InputFormat to split it into smaller part thus each Mapper
> > > can deal with piece of the csv file. However, as far as I know,
> > > FileInputFormat only cares about byte size of file, that is, the
> > > class can divide the csv file as many part, and maybe some part is
> not a well-format CVS file.
> > > For example, one line of the CSV file is not terminated with CRLF,
> > > or maybe some text is trimed.
> > >
> > > How to ensure each FileSplit is a smaller valid CSV file using a
> > > proper InputFormat?
> > >
> > > BR/anderson
> > >
> >
> > If all you care about is the splits occurring at line boundaries, then
>
> > TextInputFormat will work.
> >
> > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre
> > d/TextInputFormat.html
> >
> > If not I guess you can write your own InputFormat class.
> >
> > --
> > Harish Mallipeddi
> > http://blog.poundbang.in
> >
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>


Re: Multiple NIC Cards

2009-06-10 Thread John Martyniak
So it turns out the reason that I was getting the duey.local. was  
because that is what was in the reverse DNS on the nameserver from a  
previous test.  So that is fixed, and now the machine says  
duey.local.xxx.com.


The only remaining issue is the trailing "." (Period) that is required  
by DNS to make the name fully qualified.


So not sure if this is a bug in the Hadoop uses this information or  
some other issue.


If anybody has run across this issue before any help would be greatly  
appreciated.


Thank you,

-John

On Jun 10, 2009, at 9:21 PM, Matt Massie wrote:

If you look at the documentation for the getCanonicalHostName()  
function (thanks, Steve)...


http://java.sun.com/javase/6/docs/api/java/net/InetAddress.html#getCanonicalHostName()

you'll see two Java security properties (networkaddress.cache.ttl  
and networkaddress.cache.negative.ttl).


You might take a look at your /etc/nsswitch.conf configuration as  
well to learn how hosts are resolved on your machine, e.g...


$ grep hosts /etc/nsswitch.conf
hosts:  files dns

and lastly, you may want to check if you are running nscd (the  
NameService cache daemon).  If you are, take a look at /etc/ 
nscd.conf for the caching policy it's using.


Good luck.

-Matt



On Jun 10, 2009, at 1:09 PM, John Martyniak wrote:

That is what I thought also, is that it needs to keep that  
information somewhere, because it needs to be able to communicate  
with all of the servers.


So I deleted the /tmp/had* and /tmp/hs* directories, removed the  
log files, and grepped for the duey name in all files in config.   
And the problem still exists.  Originally I thought that it might  
have had something to do with multiple entries in the .ssh/ 
authorized_keys file but removed everything there.  And the problem  
still existed.


So I think that I am going to grab a new install of hadoop 0.19.1,  
delete the existing one and start out fresh to see if that changes  
anything.


Wish me luck:)

-John

On Jun 10, 2009, at 12:30 PM, Steve Loughran wrote:


John Martyniak wrote:
Does hadoop "cache" the server names anywhere?  Because I changed  
to using DNS for name resolution, but when I go to the nodes  
view, it is trying to view with the old name.  And I changed the  
hadoop-site.xml file so that it no longer has any of those values.


in SVN head, we try and get Java to tell us what is going on
http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java

This uses InetAddress.getLocalHost().getCanonicalHostName() to get  
the value, which is cached for life of the process. I don't know  
of anything else, but wouldn't be surprised -the Namenode has to  
remember the machines where stuff was stored.





John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: j...@beforedawnsoutions.com
w: http://www.beforedawnsolutions.com





John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: j...@beforedawnsoutions.com
w: http://www.beforedawnsolutions.com



Re: Multiple NIC Cards

2009-06-10 Thread John Martyniak

Matt,

Thanks for the suggestion.

I had actually forgotten about local dns caching.  I am using a mac so  
I used

dscacheutil -flushcache

To clear the cache, and also investigated the ordering.  And  
everything seems to be in order.


Except I still get a bogus result.

it is using the old name except with a trailing period.  So it is  
using duey.local. when it should be using duey.local.x.com (which  
is the internal name).




-John

On Jun 10, 2009, at 9:21 PM, Matt Massie wrote:

If you look at the documentation for the getCanonicalHostName()  
function (thanks, Steve)...


http://java.sun.com/javase/6/docs/api/java/net/InetAddress.html#getCanonicalHostName()

you'll see two Java security properties (networkaddress.cache.ttl  
and networkaddress.cache.negative.ttl).


You might take a look at your /etc/nsswitch.conf configuration as  
well to learn how hosts are resolved on your machine, e.g...


$ grep hosts /etc/nsswitch.conf
hosts:  files dns

and lastly, you may want to check if you are running nscd (the  
NameService cache daemon).  If you are, take a look at /etc/ 
nscd.conf for the caching policy it's using.


Good luck.

-Matt



On Jun 10, 2009, at 1:09 PM, John Martyniak wrote:

That is what I thought also, is that it needs to keep that  
information somewhere, because it needs to be able to communicate  
with all of the servers.


So I deleted the /tmp/had* and /tmp/hs* directories, removed the  
log files, and grepped for the duey name in all files in config.   
And the problem still exists.  Originally I thought that it might  
have had something to do with multiple entries in the .ssh/ 
authorized_keys file but removed everything there.  And the problem  
still existed.


So I think that I am going to grab a new install of hadoop 0.19.1,  
delete the existing one and start out fresh to see if that changes  
anything.


Wish me luck:)

-John

On Jun 10, 2009, at 12:30 PM, Steve Loughran wrote:


John Martyniak wrote:
Does hadoop "cache" the server names anywhere?  Because I changed  
to using DNS for name resolution, but when I go to the nodes  
view, it is trying to view with the old name.  And I changed the  
hadoop-site.xml file so that it no longer has any of those values.


in SVN head, we try and get Java to tell us what is going on
http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java

This uses InetAddress.getLocalHost().getCanonicalHostName() to get  
the value, which is cached for life of the process. I don't know  
of anything else, but wouldn't be surprised -the Namenode has to  
remember the machines where stuff was stored.





John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: j...@beforedawnsoutions.com
w: http://www.beforedawnsolutions.com





John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: j...@beforedawnsoutions.com
w: http://www.beforedawnsolutions.com



Re: code overview document

2009-06-10 Thread Zak Stone
Cloudera's long-form training videos about _using_ Hadoop are
excellent -- here's the intro video, and you can jump to other topics
using the links in the sidebar:

http://www.cloudera.com/hadoop-training-thinking-at-scale

Not sure what starting points to recommend for modifying Hadoop itself, though.

Zak


On Wed, Jun 10, 2009 at 11:06 PM, bharath
vissapragada wrote:
> As far as my knowledge goes , there are no good tutorials for coding on
> hadoop.. Yahoo provides a tutorial which is fa better than others
> http://developer.yahoo.com/hadoop/tutorial/ . If anyone knows any good
> mapreduce programming tutorials ..please let us know
> Thanks!
>
> On Thu, Jun 11, 2009 at 8:30 AM, Anurag Gujral wrote:
>
>> Hi All,
>>          Is there a detailed documentation on hadoop code.?
>> Thanks
>> Anurag
>>
>


2009 Hadoop Summit West - was wonderful

2009-06-10 Thread jason hadoop
I had a great time, smoozing with people, and enjoyed a couple of the talks

I would love to see more from Pria Narasimhan, hope their toolset for
automated fault detection in hadoop clusters becomes generally available.
Zookeeper rocks on!

Hbase is starting to look really good, in 0.20 the master node and the
single point of failure and configuration headache goes away and Zookeeper
takes over.

Owen O'Mally ave a solid presentation on the new Hadoop API's and the
reasons for the changes.

It was good to hang with everyone, see you all next year!

I even got to spend a little time chatting with Tom White, and a signed copy
of his book, thanks Tom!


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: code overview document

2009-06-10 Thread bharath vissapragada
As far as my knowledge goes , there are no good tutorials for coding on
hadoop.. Yahoo provides a tutorial which is fa better than others
http://developer.yahoo.com/hadoop/tutorial/ . If anyone knows any good
mapreduce programming tutorials ..please let us know
Thanks!

On Thu, Jun 11, 2009 at 8:30 AM, Anurag Gujral wrote:

> Hi All,
>  Is there a detailed documentation on hadoop code.?
> Thanks
> Anurag
>


code overview document

2009-06-10 Thread Anurag Gujral
Hi All,
  Is there a detailed documentation on hadoop code.?
Thanks
Anurag


RE: Large size Text file split

2009-06-10 Thread Wenrui Guo
I think the default TextInputFormat can meet my requirement. However,
even if the JavaDoc of TextInputFormat says the TextInputFormat could
divide input file as text lines which ends with CRLF. But I'd like to
know if the FileSplit size is not N times of line length, what will be
happen eventually?

BR/anderson 

-Original Message-
From: jason hadoop [mailto:jason.had...@gmail.com] 
Sent: Wednesday, June 10, 2009 8:39 PM
To: core-user@hadoop.apache.org
Subject: Re: Large size Text file split

There is always NLineInputFormat. You specify the number of lines per
split.
The key is the position of the line start in the file, value is the line
itself.
The parameter mapred.line.input.format.linespermap controls the number
of lines per split

On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi <
harish.mallipe...@gmail.com> wrote:

> On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo 
> wrote:
>
> > Hi, all
> >
> > I have a large csv file ( larger than 10 GB ), I'd like to use a 
> > certain InputFormat to split it into smaller part thus each Mapper 
> > can deal with piece of the csv file. However, as far as I know, 
> > FileInputFormat only cares about byte size of file, that is, the 
> > class can divide the csv file as many part, and maybe some part is
not a well-format CVS file.
> > For example, one line of the CSV file is not terminated with CRLF, 
> > or maybe some text is trimed.
> >
> > How to ensure each FileSplit is a smaller valid CSV file using a 
> > proper InputFormat?
> >
> > BR/anderson
> >
>
> If all you care about is the splits occurring at line boundaries, then

> TextInputFormat will work.
>
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre
> d/TextInputFormat.html
>
> If not I guess you can write your own InputFormat class.
>
> --
> Harish Mallipeddi
> http://blog.poundbang.in
>



--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Multiple NIC Cards

2009-06-10 Thread Matt Massie
If you look at the documentation for the getCanonicalHostName()  
function (thanks, Steve)...


http://java.sun.com/javase/6/docs/api/java/net/InetAddress.html#getCanonicalHostName()

you'll see two Java security properties (networkaddress.cache.ttl and  
networkaddress.cache.negative.ttl).


You might take a look at your /etc/nsswitch.conf configuration as well  
to learn how hosts are resolved on your machine, e.g...


$ grep hosts /etc/nsswitch.conf
hosts:  files dns

and lastly, you may want to check if you are running nscd (the  
NameService cache daemon).  If you are, take a look at /etc/nscd.conf  
for the caching policy it's using.


Good luck.

-Matt



On Jun 10, 2009, at 1:09 PM, John Martyniak wrote:

That is what I thought also, is that it needs to keep that  
information somewhere, because it needs to be able to communicate  
with all of the servers.


So I deleted the /tmp/had* and /tmp/hs* directories, removed the log  
files, and grepped for the duey name in all files in config.  And  
the problem still exists.  Originally I thought that it might have  
had something to do with multiple entries in the .ssh/ 
authorized_keys file but removed everything there.  And the problem  
still existed.


So I think that I am going to grab a new install of hadoop 0.19.1,  
delete the existing one and start out fresh to see if that changes  
anything.


Wish me luck:)

-John

On Jun 10, 2009, at 12:30 PM, Steve Loughran wrote:


John Martyniak wrote:
Does hadoop "cache" the server names anywhere?  Because I changed  
to using DNS for name resolution, but when I go to the nodes view,  
it is trying to view with the old name.  And I changed the hadoop- 
site.xml file so that it no longer has any of those values.


in SVN head, we try and get Java to tell us what is going on
http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java

This uses InetAddress.getLocalHost().getCanonicalHostName() to get  
the value, which is cached for life of the process. I don't know of  
anything else, but wouldn't be surprised -the Namenode has to  
remember the machines where stuff was stored.





John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: j...@beforedawnsoutions.com
w: http://www.beforedawnsolutions.com





Re: Hadoop streaming - No room for reduce task error

2009-06-10 Thread Todd Lipcon
Hey Scott,
It turns out that Alex's answer was mistaken - your error is actually coming
from lack of disk space on the TT that has been assigned the reduce task.
Specifically, there is not enough space in mapred.local.dir. You'll need to
change your mapred.local.dir to point to a partition that has enough space
to contain your reduce output.

As for why this is the case, I hope someone will pipe up. It seems to me
that reduce output can go directly to the target filesystem without using
space on mapred.local.dir.

Thanks
-Todd

On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard  wrote:

> What is mapred.child.ulimit set to?  This configuration options specifics
> how much memory child processes are allowed to have.  You may want to up
> this limit and see what happens.
>
> Let me know if that doesn't get you anywhere.
>
> Alex
>
> On Wed, Jun 10, 2009 at 9:40 AM, Scott  wrote:
>
> > Complete newby map/reduce question here.  I am using hadoop streaming as
> I
> > come from a Perl background, and am trying to prototype/test a process to
> > load/clean-up ad server log lines from multiple input files into one
> large
> > file on the hdfs that can then be used as the source of a hive db table.
> > I have a perl map script that reads an input line from stdin, does the
> > needed cleanup/manipulation, and writes back to stdout.I don't really
> > need a reduce step, as I don't care what order the lines are written in,
> and
> > there is no summary data to produce.  When I run the job with -reducer
> NONE
> > I get valid output, however I get multiple part-x files rather than
> one
> > big file.
> > So I wrote a trivial 'reduce' script that reads from stdin and simply
> > splits the key/value, and writes the value back to stdout.
> >
> > I am executing the code as follows:
> >
> > ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
> > "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
> > "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
> /logs/*.log
> > -output test9
> >
> > The code I have works when given a small set of input files.  However, I
> > get the following error when attempting to run the code on a large set of
> > input files:
> >
> > hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09
> 15:43:00,905
> > WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task.
> Node
> > tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920
> > bytes free; but we expect reduce input to take 22138478392
> >
> > I assume this is because the all the map output is being buffered in
> memory
> > prior to running the reduce step?  If so, what can I change to stop the
> > buffering?  I just need the map output to go directly to one large file.
> >
> > Thanks,
> > Scott
> >
> >
>


Re: Hadoop streaming - No room for reduce task error

2009-06-10 Thread Alex Loddengaard
What is mapred.child.ulimit set to?  This configuration options specifics
how much memory child processes are allowed to have.  You may want to up
this limit and see what happens.

Let me know if that doesn't get you anywhere.

Alex

On Wed, Jun 10, 2009 at 9:40 AM, Scott  wrote:

> Complete newby map/reduce question here.  I am using hadoop streaming as I
> come from a Perl background, and am trying to prototype/test a process to
> load/clean-up ad server log lines from multiple input files into one large
> file on the hdfs that can then be used as the source of a hive db table.
> I have a perl map script that reads an input line from stdin, does the
> needed cleanup/manipulation, and writes back to stdout.I don't really
> need a reduce step, as I don't care what order the lines are written in, and
> there is no summary data to produce.  When I run the job with -reducer NONE
> I get valid output, however I get multiple part-x files rather than one
> big file.
> So I wrote a trivial 'reduce' script that reads from stdin and simply
> splits the key/value, and writes the value back to stdout.
>
> I am executing the code as follows:
>
> ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
> "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
> "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input /logs/*.log
> -output test9
>
> The code I have works when given a small set of input files.  However, I
> get the following error when attempting to run the code on a large set of
> input files:
>
> hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 15:43:00,905
> WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node
> tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920
> bytes free; but we expect reduce input to take 22138478392
>
> I assume this is because the all the map output is being buffered in memory
> prior to running the reduce step?  If so, what can I change to stop the
> buffering?  I just need the map output to go directly to one large file.
>
> Thanks,
> Scott
>
>


Re: maybe a bug in hadoop?

2009-06-10 Thread Michele Catasta
Hi,

> This seems to already be addressed by
> https://issues.apache.org/jira/browse/HADOOP-2366

Thanks for the pointer, I just submitted a trivial patch for it.


Best,
  Michele


[ANNOUNCE] hamake-1.1

2009-06-10 Thread Vadim Zaliva
I would like to announce maintenance release 1.1 of hamake
(http://code.google.com/p/hamake/).

It mostly includes bug fixes, optimizations and code cleanup. There
was minor syntax changes
in hamakefile format, which is now documented here:
http://code.google.com/p/hamake/wiki/HaMakefileSyntax

This release also include RPMs in addition to source tarball.

'hamake' utility allows you to automate incremental processing of
datasets stored on HDFS using Hadoop tasks written in Java or using
PigLatin scripts. Datasets could be either individual files or
directories containing groups of files. New files may be added (or
removed) at arbitrary location which may trigger recalculation of data
depending on them. It is similar to unix 'make' utility.

Sincerely,
Vadim Zaliva


Re: Indexing on top of Hadoop

2009-06-10 Thread Stefan Groschupf

Hi,
you might find some code in katta.sourceforge.net very helpful.
Stefan

~~~
Hadoop training and consulting
http://www.scaleunlimited.com
http://www.101tec.com



On Jun 10, 2009, at 5:49 AM, kartik saxena wrote:


Hi,

I have a huge  LDIF file in order of GBs spanning some million user  
records.
I am running the example "Grep" job on that file. The search results  
have

not really been
upto expectations because of it being a basic per line , brute force.

I was thinking of building some indexes inside HDFS for that file ,  
so that
the search results could improve. What could I possibly try to  
achieve this?



Secura




Re: Multiple NIC Cards

2009-06-10 Thread John Martyniak
That is what I thought also, is that it needs to keep that information  
somewhere, because it needs to be able to communicate with all of the  
servers.


So I deleted the /tmp/had* and /tmp/hs* directories, removed the log  
files, and grepped for the duey name in all files in config.  And the  
problem still exists.  Originally I thought that it might have had  
something to do with multiple entries in the .ssh/authorized_keys file  
but removed everything there.  And the problem still existed.


So I think that I am going to grab a new install of hadoop 0.19.1,  
delete the existing one and start out fresh to see if that changes  
anything.


Wish me luck:)

-John

On Jun 10, 2009, at 12:30 PM, Steve Loughran wrote:


John Martyniak wrote:
Does hadoop "cache" the server names anywhere?  Because I changed  
to using DNS for name resolution, but when I go to the nodes view,  
it is trying to view with the old name.  And I changed the hadoop- 
site.xml file so that it no longer has any of those values.


in SVN head, we try and get Java to tell us what is going on
http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java

This uses InetAddress.getLocalHost().getCanonicalHostName() to get  
the value, which is cached for life of the process. I don't know of  
anything else, but wouldn't be surprised -the Namenode has to  
remember the machines where stuff was stored.





John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: j...@beforedawnsoutions.com
w: http://www.beforedawnsolutions.com



Re: Hadoop benchmarking

2009-06-10 Thread Owen O'Malley

Take a look at Arun's slide deck on Hadoop performance:

http://bit.ly/EDCg3

It is important to get io.sort.mb large enough, the io.sort.factor  
should be closer to 100 instead of 10. I'd also use large block sizes  
to reduce the number of maps. Please see the deck for other important  
factors.


-- Owen


Where in the WebUI do we see setStatus and stderr output?

2009-06-10 Thread Roshan James
Hi,

I am new to Hadoop and am using Pipes and Hadoop ver 0.20.0. Can someone
tell me where in the web UI we see status messages set by
TaskContext::setStatus and the stderr? Also is stdout captured somehwere?

Thanks in advance,
Roshan


Re: which Java version for hadoop-0.19.1 ?

2009-06-10 Thread Roldano Cattoni
Thanks again, Stuart.

I definitely need to search better ...

Best

  Roldano


On Wed, Jun 10, 2009 at 07:06:58PM +0200, Stuart White wrote:
> http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html#Required+Software
> 
> 
> On Wed, Jun 10, 2009 at 12:02 PM, Roldano Cattoni  wrote:
> 
> > It works, many thanks.
> >
> > Last question: is this information documented somewhere in the package? I
> > was not able to find it.
> >
> >
> >  Roldano
> >
> >
> >
> > On Wed, Jun 10, 2009 at 06:37:08PM +0200, Stuart White wrote:
> > > Java 1.6.
> > >
> > > On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni 
> > wrote:
> > >
> > > > A very basic question: which Java version is required for
> > hadoop-0.19.1?
> > > >
> > > > With jre1.5.0_06 I get the error:
> > > >  java.lang.UnsupportedClassVersionError: Bad version number in .class
> > file
> > > >  at java.lang.ClassLoader.defineClass1(Native Method)
> > > >  (..)
> > > >
> > > > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06
> > > >
> > > >
> > > > Thanks in advance for your kind help
> > > >
> > > >  Roldano
> > > >
> >


Re: which Java version for hadoop-0.19.1 ?

2009-06-10 Thread Stuart White
http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html#Required+Software


On Wed, Jun 10, 2009 at 12:02 PM, Roldano Cattoni  wrote:

> It works, many thanks.
>
> Last question: is this information documented somewhere in the package? I
> was not able to find it.
>
>
>  Roldano
>
>
>
> On Wed, Jun 10, 2009 at 06:37:08PM +0200, Stuart White wrote:
> > Java 1.6.
> >
> > On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni 
> wrote:
> >
> > > A very basic question: which Java version is required for
> hadoop-0.19.1?
> > >
> > > With jre1.5.0_06 I get the error:
> > >  java.lang.UnsupportedClassVersionError: Bad version number in .class
> file
> > >  at java.lang.ClassLoader.defineClass1(Native Method)
> > >  (..)
> > >
> > > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06
> > >
> > >
> > > Thanks in advance for your kind help
> > >
> > >  Roldano
> > >
>


Re: which Java version for hadoop-0.19.1 ?

2009-06-10 Thread Roldano Cattoni
It works, many thanks.

Last question: is this information documented somewhere in the package? I
was not able to find it.


  Roldano



On Wed, Jun 10, 2009 at 06:37:08PM +0200, Stuart White wrote:
> Java 1.6.
> 
> On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni  wrote:
> 
> > A very basic question: which Java version is required for hadoop-0.19.1?
> >
> > With jre1.5.0_06 I get the error:
> >  java.lang.UnsupportedClassVersionError: Bad version number in .class file
> >  at java.lang.ClassLoader.defineClass1(Native Method)
> >  (..)
> >
> > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06
> >
> >
> > Thanks in advance for your kind help
> >
> >  Roldano
> >


Re: Indexing on top of Hadoop

2009-06-10 Thread Tarandeep Singh
We have built basic index support in CloudBase (a data warehouse on top of
Hadoop- http://cloudbase.sourceforge.net/) and can share our experience
here-

The index we built is like a Hash Index- for a given column/field value, it
tries to process only those data blocks which contain that value while
ignoring rest of the blocks. As you would know, Hadoop stores data in the
form of data blocks (64MB default size). During index creation stage ( a Map
reduce job), we store the distinct column/field values along with their
block information (node, filename, offset etc) in Hadoop's MapFile. A Map
file is a SequenceFile (stores key,value pairs) which is sorted on keys. So
you can see, it is like building an inverted index, where keys are the
column/field values and posting lists are the lists containing blocks
information. A Map file is not a very efficient persistent map, so you can
store this inverted index in local file system using something like Lucene
Index, but as your index grows you are consuming local space of Master node.
Hence we decided to store it in HDFS using Map file.

When you query using the column/field that was indexed, first this inverted
index (MapFile) is consulted to find all the blocks that contain the desired
value and InputSplits are created. As you would know a Mapper works on Input
Split and in normal cases, Hadoop will schedule jobs that will work on all
input splits but in this case, we have written our own InputFormat that will
schedule jobs only on required input splits.

We have measured performance of this approach in some cases we have got 97%
improvement (we indexed on date field and ran queries on year's log files
but fetching only one or few particular day(s) of data).

Now some caveats-

1) If the column, field that you are indexing contain a large number of
distinct values, then your inverted index (the MapFile) is gonna bloat up
and this file is looked up before any Map Reduce job is started so this
means this code runs on master node. This can become very slow. For example,
if you have query logs and you index query column/field, then the size of
Map file can become really huge. Using Lucene Index on local file sytem can
speed up the lookup.

2) It does not work on range queries like column < 10 etc. To solve this, we
store the min and max value of the column/field found in a block also. That
is for each block, min and max value of the column/field is stored and when
we encounter range query, this list is scanned to eliminate some blocks.

Another indexing approaches we have thougth of-

To solve problem 1) where the size of inverted index (MapFile) becomes huge,
we can use an approach called BloomIndexing. This approach makes use of
BloomFilters (space-efficient probabilistic data structures that is used to
test whether an element is a member of a set or not). For each data block, a
bloom filter is constructed. For 64MB, 128MB, 256MB data block sizes, the
bloom filter will be extremely small. During index creation stage (Map job,
no reducer), a mapper reads the blok/input split and creates a bloom fitler
for the column/filed on which you want to create your index and stores it in
HDFS.

During query phase, before processing of the input split/data block first
the corresponding bloom filter is consulted to see if the data block
contains the desired value or not. As you would know a bloom filter will
never give false negative, so if bloom test fails, you can safely ignore
processing of that block. This will save you time that you would have spent
processing each row/line in the data block.

Advantages of this approach- As compared to HashIndexing, this apporach
scatters the block selection logic in Map Reduce job so master node is not
overloaded to scan huge inverted index.

Disadvantages of this approach- You still have to schedule as many mappers
as the number of input splits/data blocks and starting JVMs incur overheads,
however since hadoop-0.19 you can use "reuse jvm flag" to avoid some
overheads. Further you can increase your block size to 128 or 256MB that
will give you considerable performance improvent.

Hope this helps,
Tarandeep

On Wed, Jun 10, 2009 at 5:49 AM, kartik saxena  wrote:

> Hi,
>
> I have a huge  LDIF file in order of GBs spanning some million user
> records.
> I am running the example "Grep" job on that file. The search results have
> not really been
> upto expectations because of it being a basic per line , brute force.
>
> I was thinking of building some indexes inside HDFS for that file , so that
> the search results could improve. What could I possibly try to achieve
> this?
>
>
> Secura
>


Hadoop streaming - No room for reduce task error

2009-06-10 Thread Scott
Complete newby map/reduce question here.  I am using hadoop streaming as 
I come from a Perl background, and am trying to prototype/test a process 
to load/clean-up ad server log lines from multiple input files into one 
large file on the hdfs that can then be used as the source of a hive db 
table. 

I have a perl map script that reads an input line from stdin, does the 
needed cleanup/manipulation, and writes back to stdout.I don't 
really need a reduce step, as I don't care what order the lines are 
written in, and there is no summary data to produce.  When I run the job 
with -reducer NONE I get valid output, however I get multiple part-x 
files rather than one big file. 

So I wrote a trivial 'reduce' script that reads from stdin and simply 
splits the key/value, and writes the value back to stdout.


I am executing the code as follows:

./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper 
"/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer 
"/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input 
/logs/*.log -output test9


The code I have works when given a small set of input files.  However, I 
get the following error when attempting to run the code on a large set 
of input files:


hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 
15:43:00,905 WARN org.apache.hadoop.mapred.JobInProgress: No room for 
reduce task. Node 
tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920 
bytes free; but we expect reduce input to take 22138478392


I assume this is because the all the map output is being buffered in 
memory prior to running the reduce step?  If so, what can I change to 
stop the buffering?  I just need the map output to go directly to one 
large file.


Thanks,
Scott



Re: which Java version for hadoop-0.19.1 ?

2009-06-10 Thread Stuart White
Java 1.6.

On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni  wrote:

> A very basic question: which Java version is required for hadoop-0.19.1?
>
> With jre1.5.0_06 I get the error:
>  java.lang.UnsupportedClassVersionError: Bad version number in .class file
>  at java.lang.ClassLoader.defineClass1(Native Method)
>  (..)
>
> By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06
>
>
> Thanks in advance for your kind help
>
>  Roldano
>


which Java version for hadoop-0.19.1 ?

2009-06-10 Thread Roldano Cattoni
A very basic question: which Java version is required for hadoop-0.19.1?

With jre1.5.0_06 I get the error:
  java.lang.UnsupportedClassVersionError: Bad version number in .class file
  at java.lang.ClassLoader.defineClass1(Native Method)
  (..)

By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06 


Thanks in advance for your kind help

  Roldano


Re: Multiple NIC Cards

2009-06-10 Thread Steve Loughran

John Martyniak wrote:
Does hadoop "cache" the server names anywhere?  Because I changed to 
using DNS for name resolution, but when I go to the nodes view, it is 
trying to view with the old name.  And I changed the hadoop-site.xml 
file so that it no longer has any of those values.




in SVN head, we try and get Java to tell us what is going on
http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java

This uses InetAddress.getLocalHost().getCanonicalHostName() to get the 
value, which is cached for life of the process. I don't know of anything 
else, but wouldn't be surprised -the Namenode has to remember the 
machines where stuff was stored.





Re: Hadoop benchmarking

2009-06-10 Thread Aaron Kimball
Hi Stephen,

That will set the maximum heap allowable, but doesn't tell Hadoop's internal
systems necessarily to take advantage of it. There's a number of other
settings that adjust performance. At Cloudera we have a config tool that
generates Hadoop configurations with reasonable first-approximation values
for your cluster -- check out http://my.cloudera.com and look at the
hadoop-site.xml it generates. If you start from there you might find a
better parameter space to explore. Please share back your findings -- we'd
love to tweak the tool even more with some external feedback :)

- Aaron


On Wed, Jun 10, 2009 at 7:39 AM, stephen mulcahy
wrote:

> Hi,
>
> I'm currently doing some testing of different configurations using the
> Hadoop Sort as follows,
>
> bin/hadoop jar hadoop-*-examples.jar randomwriter
> -Dtest.randomwrite.total_bytes=107374182400 /benchmark100
>
> bin/hadoop jar hadoop-*-examples.jar sort /benchmark100 rand-sort
>
> The only changes I've made from the standard config are the following in
> conf/mapred-site.xml
>
>   
> mapred.child.java.opts
> -Xmx1024M
>   
>
>   
> mapred.tasktracker.map.tasks.maximum
> 8
>   
>
>   
> mapred.tasktracker.reduce.tasks.maximum
> 4
>   
>
> I'm running this on 4 systems, each with 8 processor cores and 4 separate
> disks.
>
> Is there anything else I should change to stress memory more? The systems
> in questions have 16GB of memory but the most thats getting used during a
> run of this benchmark is about 2GB (and most of that seems to be os
> caching).
>
> Thanks,
>
> -stephen
>
> --
> Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
>


Re: Help - ClassNotFoundException from tasktrackers

2009-06-10 Thread Aaron Kimball
mapred.local.dir can definitely work with multiple paths, but it's a bit
strict about the format. You should have one or more paths, separated by
commas and no whitespace. e.g. "/disk1/mapred,/disk2/mapred". If you've got
them on new lines, etc, then it might try to interpret that as part of one
of the paths. Could this be your issue? If you paste your hadoop-site.xml in
to an email we could take a look.

- Aaron

On Wed, Jun 10, 2009 at 2:41 AM, Harish Mallipeddi <
harish.mallipe...@gmail.com> wrote:

> Hi,
>
> After some amount of debugging, I narrowed down the problem. I'd set
> multiple paths (multiple disks mounted at different endpoints) for the *
> mapred.local.dir* setting. When I changed this to just one path, everything
> started working - previously the job jar and the job xml weren't being
> created under the local/taskTracker/jobCache folder.
>
> Any idea why? Am I doing something wrong? I've read somewhere else that
> specifying multiple disks for mapred.local.dir is important to increase
> disk
> bandwidth so that the map outputs get written faster to the local disks on
> the tasktracker nodes.
>
> Cheers,
> Harish
>
> On Tue, Jun 9, 2009 at 5:21 PM, Harish Mallipeddi <
> harish.mallipe...@gmail.com> wrote:
>
> > I setup a new cluster (1 namenode + 2 datanodes). I'm trying to run the
> > GenericMRLoadGenerator program from hadoop-0.20.0-test.jar. But I keep
> > getting the ClassNotFoundException. Any reason why this would be
> happening?
> > It seems to me like the tasktrackers cannot find the class files from the
> > hadoop program but when you do "hadoop jar", it will automatically ship
> the
> > job jar file to all the tasktracker nodes and put them in the classpath?
> >
> > $ *bin/hadoop jar hadoop-0.20.0-test.jar
> loadgen*-Dtest.randomtextwrite.bytes_per_map=1048576
> > -Dhadoop.sort.reduce.keep.percent=50.0 -outKey org.apache.hadoop.io.Text
> > -outValue org.apache.hadoop.io.Text
> >
> > No input path; ignoring InputFormat
> > Job started: Tue Jun 09 03:16:30 PDT 2009
> > 09/06/09 03:16:30 INFO mapred.JobClient: Running job:
> job_200906090316_0001
> > 09/06/09 03:16:31 INFO mapred.JobClient:  map 0% reduce 0%
> > 09/06/09 03:16:40 INFO mapred.JobClient: Task Id :
> > attempt_200906090316_0001_m_00_0, Status : FAILED
> > java.io.IOException: Split class
> >
> org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit
> > not found
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:324)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> > at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > *Caused by: java.lang.ClassNotFoundException:
> >
> org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit
> > *
> > at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
> > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
> > at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
> > at java.lang.Class.forName0(Native Method)
> > at java.lang.Class.forName(Class.java:247)
> > at
> >
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:761)
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:321)
> > ... 2 more
> > [more of the same exception from different map tasks...I've removed them]
> >
> > Actually the same thing happens even if I run the PiEstimator program
> from
> > hadoop-0.20.0-examples.jar.
> >
> > Thanks,
> >
> > --
> > Harish Mallipeddi
> > http://blog.poundbang.in
> >
>
>
>
> --
> Harish Mallipeddi
> http://blog.poundbang.in
>


Re: Multiple NIC Cards

2009-06-10 Thread John Martyniak
Does hadoop "cache" the server names anywhere?  Because I changed to  
using DNS for name resolution, but when I go to the nodes view, it is  
trying to view with the old name.  And I changed the hadoop-site.xml  
file so that it no longer has any of those values.


Any help would be appreciated.

Thank you,

-John

On Jun 9, 2009, at 9:24 PM, John Martyniak wrote:



So I setup a dns server that is for the internal network.  changed  
all of the names to duey.local, and created a master zone for .local  
on the DNS.  Put the domains server as the first one in /etc/ 
resolv.conf file, added it to the interface.  I changed the hostname  
of the machine that it is running on from duey..com to  
duey.local.  Checked that the dns resolves, and it does. Ran  
nslookup and returns the name of the machine given the ip address.


changed all of the names from the IP Addresses to duey.local, in my  
hadoop-site.xml, changed the names in the masters and slaves files.


Deleted all of the logs, deleted the /tmp directory stuff.

Then restarted hadoop.  And much to my surprise.it still didn't  
work.


I really thought that this would work as it seems to be the  
consensus that the issue is the resolution of the name.


Any other thoughts would be greatly appreciated.

-John





On Jun 9, 2009, at 3:17 PM, Raghu Angadi wrote:



I still need to go through the whole thread. but we feel your pain.

First, please try setting fs.default.name to namenode internal ip  
on the datanodes. This should make NN to attach internal ip so the  
datanodes (assuming your routing is correct). NameNode webUI should  
list internal ips for datanode. You might have to temporarily  
change NameNode code to listen on 0.0.0.0.


That said, The issues you are facing are pretty unfortunate. As  
Steve mentioned Hadoop is all confused about hostname/ip and there  
is unecessary reliance on hostname and reverse DNS look ups in many  
many places.


At least fairly straight fwd set ups with multiple NICs should be  
handled well.


dfs.datanode.dns.interface should work like you expected (but not  
very surprised it didn't).


Another thing you could try is setting dfs.datanode.address to the  
internal ip address (this might already be discussed in the  
thread). This should at least get all the bulk datatransfers happen  
over internal NICs. One way to make sure is to hover on the  
datanode node on NameNode webUI.. it shows the ip address.


good luck.

It might be better document your pains and findings in a Jira (with  
most of the details in one or more comments rather than in  
description).


Raghu.

John Martyniak wrote:
So I changed all of the 0.0.0.0 on one machine to point to the  
192.168.1.102 address.
And still it picks up the hostname and ip address of the external  
network.
I am kind of at my wits end with this, as I am not seeing a  
solution yet, except to take the machines off of the external  
network and leave them on the internal network which isn't an  
option.

Has anybody had this problem before?  What was the solution?
-John
On Jun 9, 2009, at 10:17 AM, Steve Loughran wrote:
One thing to consider is that some of the various services of  
Hadoop are bound to 0:0:0:0, which means every Ipv4 address, you  
really want to bring up everything, including jetty services, on  
the en0 network adapter, by binding them to  192.168.1.102; this  
will cause anyone trying to talk to them over the other network  
to fail, which at least find the problem sooner rather than later






John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: j...@beforedawnsoutions.com
w: http://www.beforedawnsolutions.com



Re: speedy Google Maps driving directions like implementation

2009-06-10 Thread Joerg Rieger

Hello,

On 10.06.2009, at 11:43, Lukáš Vlček wrote:


Hi,
I am wondering how Google implemented the driving directions  
function in the
Maps. More specifically how did they do it that it is so fast. I  
asked on
Google engineer about this and all he told me is just that there are  
bunch
of MapReduce cycles involved in this process but I don't think it is  
really
close to the truth. How it is possible to implement such real-time  
function

in plain MapReduce fashion (from the Hadoop point of view)? The only
possibility I can think of right now it that they either execute the
MapReduce computation only in the memory (intermediate results as  
well as
Reduce to next Map results are kept only in some-kind of distributed  
memory)

or they use other architecture for this.

In simple words I know there are some tutorials about how to nail  
down SSSP
problem with MapReduce but I can not believe it can produce results  
in such

a quick response I can experience with Google Maps.

Any comments, ideas?



I don't think they use MapReduce for the user-facing part of Google  
Maps.


If I had to implement something similar, I would use MapReduce jobs to
preform as much as "offline"-processing as possible.

Therefore my guess is, that MapReduce-Jobs are used to create the  
static map

tiles from the huge satellite imagery.

I also think that MapReduce jobs create an index or a graph  
representation of
roads and their intersections with other roads, including  
intersections with

towns and other points of interest.

The resulting graph is probably stored in a Memcache type memory  
system for

very fast retrieval.

But then again, I'm not a GIS expert (not even close). ;-)


J

--





Re: maybe a bug in hadoop?

2009-06-10 Thread stephen mulcahy

Hi,

This seems to already be addressed by 
https://issues.apache.org/jira/browse/HADOOP-2366


-stephen

Brian Bockelman wrote:

Hey Stephen,

I've hit this "bug" before (rather, our admins did...).

I would be happy to see you file it  - after checking for duplicates - 
so I no longer have to warn people about it.


Brian

On Jun 10, 2009, at 6:29 AM, stephen mulcahy wrote:


Tim Wintle wrote:

I thought I'd double check...
$touch \ file\ with\ leading\ space
$touch file\ without\ leading\ space
$ ls
file with leading space
file without leading space


Oh sure, Linux will certainly allow you to create files/dirs with such 
names (and better .. hows about


touch \.\.\.

or similar (which used to be a favourite of guys who had managed to 
exploit bad permissions on ftp sites to upload material of dubious 
copyright). But is that the intention in the Hadoop config or is it a 
usability oversight? Hence my mail here rather than raising a bug 
immediately.



... so stripping it out could mean you couldn't enter some valid
directory names (but who has folders starting with a space?)


Well ... I do now ;)

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com





--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Hadoop benchmarking

2009-06-10 Thread stephen mulcahy

Hi,

I'm currently doing some testing of different configurations using the 
Hadoop Sort as follows,


bin/hadoop jar hadoop-*-examples.jar randomwriter 
-Dtest.randomwrite.total_bytes=107374182400 /benchmark100


bin/hadoop jar hadoop-*-examples.jar sort /benchmark100 rand-sort

The only changes I've made from the standard config are the following in 
conf/mapred-site.xml


   
 mapred.child.java.opts
 -Xmx1024M
   

   
 mapred.tasktracker.map.tasks.maximum
 8
   

   
 mapred.tasktracker.reduce.tasks.maximum
 4
   

I'm running this on 4 systems, each with 8 processor cores and 4 
separate disks.


Is there anything else I should change to stress memory more? The 
systems in questions have 16GB of memory but the most thats getting used 
during a run of this benchmark is about 2GB (and most of that seems to 
be os caching).


Thanks,

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Re: HDFS issues..!

2009-06-10 Thread Todd Lipcon
On Wed, Jun 10, 2009 at 4:55 AM, Sugandha Naolekar
wrote:

> If I want to make the data transfer fast, then what am I supposed
> to do? I want to place the data in HDFS and replicate it in fraction of
> seconds.


I want to go to France, but it takes 10+ hours to get there from California
on the fastest plane. How can I get there faster?


> Can that be possible. and How? Placing a 5GB file will take atleast
> half n hour...or so...but, if its a large cluster, lets say, of 7nodes, and
> then placing it in HDFS would take around 2-3 hours. So, how that time
> delay
> can be avoided..?
>

HDFS will only replicate as many times as you want it to. The write is also
pipelined. This means that writing a 5G file that is replicated to 3 nodes
is only marginally faster than the same file on 10 nodes, if for some reason
you wanted to set your replication count to 10 (unnecessary for 99.9% of
use cases)


>
> Also, My simply aim is to transfer the data, i.e; dumping the data
> into HDFS and gettign it back whenever needed. So, for this, transfer, how
> speed can be achieved?


HDFS isn't magic. You can only write as fast as your disk and network can.
If your disk has 50MB/sec of throughput, you'll probably be limited at
50MB/sec. Expecting much more than this in real life scenarios is
unrealistic.

-Todd


Re: HDFS data transfer!

2009-06-10 Thread Brian Bockelman

Hey Sugandha,

Transfer rates depend on the quality/quantity of your hardware and the  
quality of your client disk that is generating the data.  I usually  
say that you should expect near-hardware-bottleneck speeds for an  
otherwise idle cluster.


There should be no "make it fast" required (though you should reviewi  
the logs for errors if it's going slow).  I would expect a 5GB file to  
take around 3-5 minutes to write on our cluster, but it's a well-tuned  
and operational cluster.


As Todd (I think) mentioned before, we can't help any when you say "I  
want to make it faster".  You need to provide diagnostic information -  
logs, Ganglia plots, stack traces, something - that folks can look at.


Brian

On Jun 10, 2009, at 2:25 AM, Sugandha Naolekar wrote:

But if I want to make it fast, then??? I want to place the data in  
HDFS and

reoplicate it in fraction of seconds. Can that be possible. and How?

On Wed, Jun 10, 2009 at 2:47 PM, kartik saxena  
 wrote:


I would suppose about 2-3 hours. It took me some 2 days to load a  
160 Gb

file.
Secura

On Wed, Jun 10, 2009 at 11:56 AM, Sugandha Naolekar
wrote:It


Hello!

If I try to transfer a 5GB VDI file from a remote host(not a part of

hadoop

cluster) into HDFS, and get it back, how much time is it supposed to

take?


No map-reduce involved. Simply Writing files in and out from HDFS  
through

a

simple code of java (usage of API's).

--
Regards!
Sugandha







--
Regards!
Sugandha




Re: maybe a bug in hadoop?

2009-06-10 Thread Brian Bockelman

Hey Stephen,

I've hit this "bug" before (rather, our admins did...).

I would be happy to see you file it  - after checking for duplicates -  
so I no longer have to warn people about it.


Brian

On Jun 10, 2009, at 6:29 AM, stephen mulcahy wrote:


Tim Wintle wrote:

I thought I'd double check...
$touch \ file\ with\ leading\ space
$touch file\ without\ leading\ space
$ ls
file with leading space
file without leading space


Oh sure, Linux will certainly allow you to create files/dirs with  
such names (and better .. hows about


touch \.\.\.

or similar (which used to be a favourite of guys who had managed to  
exploit bad permissions on ftp sites to upload material of dubious  
copyright). But is that the intention in the Hadoop config or is it  
a usability oversight? Hence my mail here rather than raising a bug  
immediately.



... so stripping it out could mean you couldn't enter some valid
directory names (but who has folders starting with a space?)


Well ... I do now ;)

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com




Re: maybe a bug in hadoop?

2009-06-10 Thread stephen mulcahy

Tim Wintle wrote:

I thought I'd double check...

$touch \ file\ with\ leading\ space
$touch file\ without\ leading\ space
$ ls
 file with leading space
file without leading space


Oh sure, Linux will certainly allow you to create files/dirs with such 
names (and better .. hows about


touch \.\.\.

or similar (which used to be a favourite of guys who had managed to 
exploit bad permissions on ftp sites to upload material of dubious 
copyright). But is that the intention in the Hadoop config or is it a 
usability oversight? Hence my mail here rather than raising a bug 
immediately.



... so stripping it out could mean you couldn't enter some valid
directory names (but who has folders starting with a space?)


Well ... I do now ;)

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Re: maybe a bug in hadoop?

2009-06-10 Thread Tim Wintle
On Wed, 2009-06-10 at 13:56 +0100, stephen mulcahy wrote:
> Hi,
> 
> I've been running some tests with hadoop in pseudo-distributed mode. My 
> config includes the following in conf/hdfs-site.xml
> 
> 
>   dfs.data.dir
>   /hdfs/disk1, /hdfs/disk2, /hdfs/disk3, /hdfs/disk4
> 
> 
> When I started running hadoop I was disappointed to see that only 
> /hdfs/disk1 was created. It was only later I noticed that HADOOP_HOME 
> now contained the following new directories
> 
> HADOOP_HOME/ /hdfs/disk2
> HADOOP_HOME/ /hdfs/disk3
> HADOOP_HOME/ /hdfs/disk4
> 
> I guess any leading spaces should be stripped out of the data dir names? 
> Or maybe there is a reason for this behaviour. I thought I should 
> mention it just in case.

I thought I'd double check...

$touch \ file\ with\ leading\ space
$touch file\ without\ leading\ space
$ ls
 file with leading space
file without leading space

... so stripping it out could mean you couldn't enter some valid
directory names (but who has folders starting with a space?)

I'm sure my input wasn't very useful, but just a comment.

Tim Wintle




maybe a bug in hadoop?

2009-06-10 Thread stephen mulcahy

Hi,

I've been running some tests with hadoop in pseudo-distributed mode. My 
config includes the following in conf/hdfs-site.xml


   
 dfs.data.dir
 /hdfs/disk1, /hdfs/disk2, /hdfs/disk3, /hdfs/disk4
   

When I started running hadoop I was disappointed to see that only 
/hdfs/disk1 was created. It was only later I noticed that HADOOP_HOME 
now contained the following new directories


HADOOP_HOME/ /hdfs/disk2
HADOOP_HOME/ /hdfs/disk3
HADOOP_HOME/ /hdfs/disk4

I guess any leading spaces should be stripped out of the data dir names? 
Or maybe there is a reason for this behaviour. I thought I should 
mention it just in case.


Thanks,

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Indexing on top of Hadoop

2009-06-10 Thread kartik saxena
Hi,

I have a huge  LDIF file in order of GBs spanning some million user records.
I am running the example "Grep" job on that file. The search results have
not really been
upto expectations because of it being a basic per line , brute force.

I was thinking of building some indexes inside HDFS for that file , so that
the search results could improve. What could I possibly try to achieve this?


Secura


Re: Large size Text file split

2009-06-10 Thread jason hadoop
There is always NLineInputFormat. You specify the number of lines per split.
The key is the position of the line start in the file, value is the line
itself.
The parameter mapred.line.input.format.linespermap controls the number of
lines per split

On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi <
harish.mallipe...@gmail.com> wrote:

> On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo 
> wrote:
>
> > Hi, all
> >
> > I have a large csv file ( larger than 10 GB ), I'd like to use a certain
> > InputFormat to split it into smaller part thus each Mapper can deal with
> > piece of the csv file. However, as far as I know, FileInputFormat only
> > cares about byte size of file, that is, the class can divide the csv
> > file as many part, and maybe some part is not a well-format CVS file.
> > For example, one line of the CSV file is not terminated with CRLF, or
> > maybe some text is trimed.
> >
> > How to ensure each FileSplit is a smaller valid CSV file using a proper
> > InputFormat?
> >
> > BR/anderson
> >
>
> If all you care about is the splits occurring at line boundaries, then
> TextInputFormat will work.
>
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html
>
> If not I guess you can write your own InputFormat class.
>
> --
> Harish Mallipeddi
> http://blog.poundbang.in
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Large size Text file split

2009-06-10 Thread Harish Mallipeddi
On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo  wrote:

> Hi, all
>
> I have a large csv file ( larger than 10 GB ), I'd like to use a certain
> InputFormat to split it into smaller part thus each Mapper can deal with
> piece of the csv file. However, as far as I know, FileInputFormat only
> cares about byte size of file, that is, the class can divide the csv
> file as many part, and maybe some part is not a well-format CVS file.
> For example, one line of the CSV file is not terminated with CRLF, or
> maybe some text is trimed.
>
> How to ensure each FileSplit is a smaller valid CSV file using a proper
> InputFormat?
>
> BR/anderson
>

If all you care about is the splits occurring at line boundaries, then
TextInputFormat will work.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html

If not I guess you can write your own InputFormat class.

-- 
Harish Mallipeddi
http://blog.poundbang.in


Large size Text file split

2009-06-10 Thread Wenrui Guo
Hi, all

I have a large csv file ( larger than 10 GB ), I'd like to use a certain
InputFormat to split it into smaller part thus each Mapper can deal with
piece of the csv file. However, as far as I know, FileInputFormat only
cares about byte size of file, that is, the class can divide the csv
file as many part, and maybe some part is not a well-format CVS file.
For example, one line of the CSV file is not terminated with CRLF, or
maybe some text is trimed.

How to ensure each FileSplit is a smaller valid CSV file using a proper
InputFormat?

BR/anderson 


HDFS issues..!

2009-06-10 Thread Sugandha Naolekar
 If I want to make the data transfer fast, then what am I supposed
to do? I want to place the data in HDFS and replicate it in fraction of
seconds. Can that be possible. and How? Placing a 5GB file will take atleast
half n hour...or so...but, if its a large cluster, lets say, of 7nodes, and
then placing it in HDFS would take around 2-3 hours. So, how that time delay
can be avoided..?

 Also, My simply aim is to transfer the data, i.e; dumping the data
into HDFS and gettign it back whenever needed. So, for this, transfer, how
speed can be achieved?
-- 
Regards!
Sugandha


speedy Google Maps driving directions like implementation

2009-06-10 Thread Lukáš Vlček
Hi,
I am wondering how Google implemented the driving directions function in the
Maps. More specifically how did they do it that it is so fast. I asked on
Google engineer about this and all he told me is just that there are bunch
of MapReduce cycles involved in this process but I don't think it is really
close to the truth. How it is possible to implement such real-time function
in plain MapReduce fashion (from the Hadoop point of view)? The only
possibility I can think of right now it that they either execute the
MapReduce computation only in the memory (intermediate results as well as
Reduce to next Map results are kept only in some-kind of distributed memory)
or they use other architecture for this.

In simple words I know there are some tutorials about how to nail down SSSP
problem with MapReduce but I can not believe it can produce results in such
a quick response I can experience with Google Maps.

Any comments, ideas?

Thanks,
Lukas


Re: Help - ClassNotFoundException from tasktrackers

2009-06-10 Thread Harish Mallipeddi
Hi,

After some amount of debugging, I narrowed down the problem. I'd set
multiple paths (multiple disks mounted at different endpoints) for the *
mapred.local.dir* setting. When I changed this to just one path, everything
started working - previously the job jar and the job xml weren't being
created under the local/taskTracker/jobCache folder.

Any idea why? Am I doing something wrong? I've read somewhere else that
specifying multiple disks for mapred.local.dir is important to increase disk
bandwidth so that the map outputs get written faster to the local disks on
the tasktracker nodes.

Cheers,
Harish

On Tue, Jun 9, 2009 at 5:21 PM, Harish Mallipeddi <
harish.mallipe...@gmail.com> wrote:

> I setup a new cluster (1 namenode + 2 datanodes). I'm trying to run the
> GenericMRLoadGenerator program from hadoop-0.20.0-test.jar. But I keep
> getting the ClassNotFoundException. Any reason why this would be happening?
> It seems to me like the tasktrackers cannot find the class files from the
> hadoop program but when you do "hadoop jar", it will automatically ship the
> job jar file to all the tasktracker nodes and put them in the classpath?
>
> $ *bin/hadoop jar hadoop-0.20.0-test.jar 
> loadgen*-Dtest.randomtextwrite.bytes_per_map=1048576
> -Dhadoop.sort.reduce.keep.percent=50.0 -outKey org.apache.hadoop.io.Text
> -outValue org.apache.hadoop.io.Text
>
> No input path; ignoring InputFormat
> Job started: Tue Jun 09 03:16:30 PDT 2009
> 09/06/09 03:16:30 INFO mapred.JobClient: Running job: job_200906090316_0001
> 09/06/09 03:16:31 INFO mapred.JobClient:  map 0% reduce 0%
> 09/06/09 03:16:40 INFO mapred.JobClient: Task Id :
> attempt_200906090316_0001_m_00_0, Status : FAILED
> java.io.IOException: Split class
> org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit
> not found
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:324)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> *Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit
> *
> at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
> at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:761)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:321)
> ... 2 more
> [more of the same exception from different map tasks...I've removed them]
>
> Actually the same thing happens even if I run the PiEstimator program from
> hadoop-0.20.0-examples.jar.
>
> Thanks,
>
> --
> Harish Mallipeddi
> http://blog.poundbang.in
>



-- 
Harish Mallipeddi
http://blog.poundbang.in


Re: HDFS data transfer!

2009-06-10 Thread Sugandha Naolekar
But if I want to make it fast, then??? I want to place the data in HDFS and
reoplicate it in fraction of seconds. Can that be possible. and How?

On Wed, Jun 10, 2009 at 2:47 PM, kartik saxena  wrote:

> I would suppose about 2-3 hours. It took me some 2 days to load a 160 Gb
> file.
> Secura
>
> On Wed, Jun 10, 2009 at 11:56 AM, Sugandha Naolekar
> wrote:It
>
> > Hello!
> >
> > If I try to transfer a 5GB VDI file from a remote host(not a part of
> hadoop
> > cluster) into HDFS, and get it back, how much time is it supposed to
> take?
> >
> > No map-reduce involved. Simply Writing files in and out from HDFS through
> a
> > simple code of java (usage of API's).
> >
> > --
> > Regards!
> > Sugandha
> >
>



-- 
Regards!
Sugandha


Re: Can not build Hadoop in CentOS 64bit

2009-06-10 Thread Ian jonhson
It works, thanks



On Wed, Jun 10, 2009 at 2:28 PM, Todd Lipcon wrote:
> Hi Ian,
>
> Recent versions of Hadoop require Java 1.6. You will not be able to
> successfully compile on Java 1.5
>


Re: HDFS data transfer!

2009-06-10 Thread kartik saxena
I would suppose about 2-3 hours. It took me some 2 days to load a 160 Gb
file.
Secura

On Wed, Jun 10, 2009 at 11:56 AM, Sugandha Naolekar
wrote:It

> Hello!
>
> If I try to transfer a 5GB VDI file from a remote host(not a part of hadoop
> cluster) into HDFS, and get it back, how much time is it supposed to take?
>
> No map-reduce involved. Simply Writing files in and out from HDFS through a
> simple code of java (usage of API's).
>
> --
> Regards!
> Sugandha
>