Re: Decommissioning Nodes

2009-01-21 Thread Jeremy Chow
Hey Alyssa,
If one of those datanodes down, a few minutes will pass when master discover
this phenomenon. Master node takes those nodes which have not send heatbeat
for quite a while as dead ones.

On Thu, Jan 22, 2009 at 8:34 AM, Hargraves, Alyssa  wrote:

> Hello Hadoop Users,
>
> I was hoping someone would be able to answer a question about node
> decommissioning.  I have a test Hadoop cluster set up which only consists of
> my computer and a master node.  I am looking at the removal and addition of
> nodes.  Adding a node is nearly instant (only about 5 seconds), but removing
> a node by decommissioning it takes a while, and I don't understand why.
> Currently, the systems are running no map/reduce tasks and storing no data.
> DFS Health reports:
>
> 7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31
> MB (0%)
> Capacity:   298.02 GB
> DFS Remaining   :   245.79 GB
> DFS Used:   4 KB
> DFS Used%   :   0 %
> Live Nodes  :   2
> Dead Nodes  :   0
>
> Node Last ContactAdmin State Size (GB)   Used (%)
>  Used (%)Remaining (GB)  Blocks
> master  0   In Service  149.01  0
>122.22  0
> slave   82  Decommission In Progress149.01  0
>123.58  0
>
> However, even with nothing stored and nothing running, the decommission
> process takes 3 to 5 minutes, and I'm not quite sure why. There isn't any
> data to move anywhere, and there aren't any jobs to worry about.  I am using
> 0.18.2.
>
> Thank you for any help in solving this,
> Alyssa Hargraves




-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

http://coderplay.javaeye.com


Decommissioning Nodes

2009-01-21 Thread Hargraves, Alyssa
Hello Hadoop Users,

I was hoping someone would be able to answer a question about node 
decommissioning.  I have a test Hadoop cluster set up which only consists of my 
computer and a master node.  I am looking at the removal and addition of nodes. 
 Adding a node is nearly instant (only about 5 seconds), but removing a node by 
decommissioning it takes a while, and I don't understand why. Currently, the 
systems are running no map/reduce tasks and storing no data. DFS Health reports:

7 files and directories, 0 blocks = 7 total. Heap Size is 6.68 MB / 992.31 MB 
(0%)
Capacity:   298.02 GB
DFS Remaining   :   245.79 GB
DFS Used:   4 KB
DFS Used%   :   0 %
Live Nodes  :   2
Dead Nodes  :   0

Node Last ContactAdmin State Size (GB)   Used (%)Used 
(%)Remaining (GB)  Blocks
master  0   In Service  149.01  0   
122.22  0
slave   82  Decommission In Progress149.01  0   
123.58  0 

However, even with nothing stored and nothing running, the decommission process 
takes 3 to 5 minutes, and I'm not quite sure why. There isn't any data to move 
anywhere, and there aren't any jobs to worry about.  I am using 0.18.2.

Thank you for any help in solving this,
Alyssa Hargraves

Re: using distcp for http source files

2009-01-21 Thread Doug Cutting

Derek Young wrote:
Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like 
this should be supported, but the http URLs are not working for me.  Are 
http source URLs still supported?


No.  They used to be supported, but when distcp was converted to accept 
any Path this stopped working, since there is no FileSystem 
implementation mapped to http: paths.  Implementing an HttpFileSystem 
that supports read-only access to files and no directory listings is 
fairly trivial, but without directory listings, distcp would not work well.


https://issues.apache.org/jira/browse/HADOOP-1563 includes a now 
long-stale patch that implements an HTTP filesystem, where directory 
listings are implemented, assuming that:

  - directories are represented by slash-terminated urls;
  - GET of a directory contains the URLs of its children
This works for the directory listings returned by many HTTP servers.

Perhaps someone can update this patch, and, if folks find it useful, we 
can include it.


Doug


RE: Hadoop User Group Meeting (Bay Area) 1/21

2009-01-21 Thread Ajay Anand
Reminder - the Bay Area Hadoop User Group meeting is today at 6 pm.

 



From: Ajay Anand 
Sent: Thursday, January 08, 2009 12:10 PM
To: 'core-user@hadoop.apache.org'; 'gene...@hadoop.apache.org';
'zookeeper-u...@hadoop.apache.org'; 'hbase-u...@hadoop.apache.org';
pig-u...@hadoop.apache.org
Subject: Hadoop User Group Meeting (Bay Area) 1/21

 

The next Bay Area Hadoop User Group meeting is scheduled for Wednesday,
January 21st at Yahoo! 2821 Mission College Blvd, Santa Clara, Building
1, Training Rooms 3 & 4 from 6:00-7:30 pm.

 

Agenda:

Hadoop 0.20 Overview - Sameer Paranjpye

Hadoop 1.0 discussion - Sanjay Radia.

 

Please send me any other suggestions for topics for future meetings.

 

Registration: http://upcoming.yahoo.com/event/1478614

 

Look forward to seeing you there!

Ajay

 

 



running hadoop on heterogeneous hardware

2009-01-21 Thread Bill Au
Is hadoop designed to run on homogeneous hardware only, or does it work just
as well on heterogeneous hardware as well?  If the datanodes have different
disk capacities, does HDFS still spread the data blocks equally amount all
the datanodes, or will the datanodes with high disk capacity end up storing
more data blocks?  Similarily, if the tasktrackres have different numbers of
CPUs, is there a way to configure hadoop to run more tasks on those
tasktrackers that have more CPUs?  Is that simply a matter of setting
mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum differently on the tasktrackers?

Bill


Re: using distcp for http source files

2009-01-21 Thread Derek Young
Tsz Wo (Nicholas), Sze  writes:

> 
> Hi Derek,
> 
> The "http" in "http://core:7274/logs/log.20090121"; should be "hftp".  hftp is 
the scheme name of
> HftpFileSystem which uses http for accessing hdfs.
> 
> Hope this helps.
> 
> Nicholas Sze


I thought hftp is used to talk to servlets that act as a gateway to hdfs 
right?  In my case these will be servers that are serving up static log files, 
running no servlets.  I believe this is the scenario that HADOOP-341 describes: 
"Enhance it [distcp] to handle http as the source protocol i.e. support copying 
files from arbitrary http-based sources into the dfs."

In any case if I just use hftp instead of http I get this error:

bin/hadoop distcp -f hftp://core:7274/logs/log.20090121 /user/dyoung/mylogs

With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: Server returned HTTP response code: 400 for 
URL: http://core:7274/data/logs/log.20090121?
ugi=dyoung,dyoung,adm,dialout,fax,cdrom,cdrom,\
floppy,floppy,tape,audio,audio,dip,dip,video,video,\
plugdev,plugdev,admin,users,scanner,fuse,fuse,lpadmin,\
admin,vboxusers
at sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1241)
at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:124)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:359)
at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:581)
at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74)
at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)

> 
> - Original Message 
> > From: Derek Young 
> > To: core-u...@...
> > Sent: Wednesday, January 21, 2009 1:23:56 PM
> > Subject: using distcp for http source files
> > 
> > I plan to use hadoop to do some log processing and I'm working on a method 
to 
> > load the files (probably nightly) into hdfs.  My plan is to have a web 
server on 
> > each machine with logs that serves up the log directories.  Then I would 
give 
> > distcp a list of http URLs of the log files and have it copy the files in.
> > 
> > Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this 
> > should be supported, but the http URLs are not working for me.  Are http 
source 
> > URLs still supported?
> > 
> > I tried a simple test with an http source URL (using Hadoop 0.19):
> > 
> > hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs
> > 
> > This fails:
> > 
> > With failures, global counters are inaccurate; consider running with -i
> > Copy failed: java.io.IOException: No FileSystem for scheme: http
> >at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364)
> >at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
> >at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
> >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
> >at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
> >at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578)
> >at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74)
> >at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775)
> >at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
> >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> >at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)
> 
> 






RE: Problem running hdfs_test

2009-01-21 Thread Arifa Nisar
Hello,

As I mentioned in my previous email, I am having segmentation fault at
0x0001 while running hdfs_test. I was suggested to build and run
hdfs_test usning ant, as ant should set some environment variable which
Makefile won't. I tried building libhdfs and running hdfs_test using ant but
I am still having same problem. Now, instead of hdfs_test, I am testing a
simple test with libhdfs. I linked a following hello world program with
libhdfs.

#include "hdfs.h"
int main() {
  printf("Hello World.\n");
  return(0);
}

I added a line to compile this test program in
${HADOOP_HOME}/src/c++/libhdfs/Makefile and replaced hdfs_test with this
test program in {HADOOP_HOME}/src/c++/libhdfs/tests/test-libhdfs.sh. I build
and invoked this test using test-libhdfs target in build.xml but I am still
having segmentation fault when this simple test program is invoked from
test-libhdfs.sh. I followed the following steps

cd ${HADOOP_HOME}
ant clean
cd ${HADOOP_HOME}/src/c++/libhdfs/
rm -f hdfs_test hdfs_write hdfs_read libhdfs.so* *.o test
Cd ${HADOOP_HOME}
ant test-libhdfs -Dlibhdfs=1

Error Line
--
[exec] ./tests/test-libhdfs.sh: line 85: 23019 Segmentation fault
$LIBHDFS_BUILD_DIR/$HDFS_TEST

I have attached the output of this command with this email. I have added
"env" in test-libhdfs.sh to see what environmental variable are set. Please
suggest if any variable is wrongly set. Any kind of suggestion will be
helpful for me as I have already spent a lot of time on this problem. 

I have added following lines in Makefile and test-libhdfs.sh

Makefile
-
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-icedtea-1.7.0.0.x86_64
export OS_ARCH=amd64
export OS_NAME=Linux
export LIBHDFS_BUILD_DIR=$(HADOOP_HOME)/src/c++/libhdfs
export SHLIB_VERSION=1

test-libhdfs.sh
--

HADOOP_CONF_DIR=${HADOOP_HOME}/conf
HADOOP_LOG_DIR=${HADOOP_HOME}/logs
LIBHDFS_BUILD_DIR=${HADOOP_HOME}/src/c++/libhdfs
HDFS_TEST=test

When I don't link libdhfs with test.c it doesn't give error and prints
"Hello World" when "ant test-libhdfs -Dlibhdfs=1" is run. I made sure that
"ant" and "hadoop" uses same java installation, I have tried this on 32 bit
machine but I am still having segmentation fault. Now, I am clueless what I
can do to correct this. Please help.

Thanks,
Arifa.

PS: Also please suggest is there any java version of hdfs_test?  

-Original Message-
From: Delip Rao [mailto:delip...@gmail.com] 
Sent: Saturday, January 17, 2009 3:49 PM
To: core-user@hadoop.apache.org
Subject: Re: Problem running unning hdfs_test

Try enabling the debug flags while compiling to get more information.

On Sat, Jan 17, 2009 at 4:19 AM, Arifa Nisar 
wrote:
> Hello all,
>
>
>
> I am trying to test hdfs_test.c provided with hadoop installation.
> libhdfs.so and hdfs_test are built fine after making a few  changes in
> $(HADOOP_HOME)/src/c++/libhdfs/Makefile. But when I try to run
./hdfs_test,
> I get segmentation fault at 0x0001
>
>
>
> Program received signal SIGSEGV, Segmentation fault.
>
> 0x0001 in ?? ()
>
> (gdb) bt
>
> #0  0x0001 in ?? ()
>
> #1  0x7fffd0c51af5 in ?? ()
>
> #2  0x in ?? ()
>
>
>
> A simple hello world program linked with libdhfs.so also gives the same
> error. In CLASSPATH all the jar files in $(HADOOP_HOME),
> $(HADOOP_HOME)/conf, $(HADOOP_HOME)/lib,$(JAVA_HOME)/lib are included.
> Please help.
>
>
>
> Thanks,
>
> Arifa.
>
>
>
>
Buildfile: build.xml

init:
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/classes
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/tools
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/src
[mkdir] Created dir: 
/home/ani662/hadoop/hadoop-0.19.0/build/webapps/task/WEB-INF
[mkdir] Created dir: 
/home/ani662/hadoop/hadoop-0.19.0/build/webapps/job/WEB-INF
[mkdir] Created dir: 
/home/ani662/hadoop/hadoop-0.19.0/build/webapps/hdfs/WEB-INF
[mkdir] Created dir: 
/home/ani662/hadoop/hadoop-0.19.0/build/webapps/datanode/WEB-INF
[mkdir] Created dir: 
/home/ani662/hadoop/hadoop-0.19.0/build/webapps/secondary/WEB-INF
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/examples
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/ant
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/c++
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/test
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/test/classes
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/test/testjar
[mkdir] Created dir: /home/ani662/hadoop/hadoop-0.19.0/build/test/testshell
[touch] Creating /tmp/null333580654
   [delete] Deleting: /tmp/null333580654
 [copy] Copying 7 files to /home/ani662/hadoop/hadoop-0.19.0/build/webapps
 [exec] svn: '.' is not a working copy
 [exec] svn: '.' is not a working copy

compile-libhdfs:
[mkdir] 

Re: using distcp for http source files

2009-01-21 Thread Tsz Wo (Nicholas), Sze
Hi Derek,

The "http" in "http://core:7274/logs/log.20090121"; should be "hftp".  hftp is 
the scheme name of HftpFileSystem which uses http for accessing hdfs.

Hope this helps.

Nicholas Sze




- Original Message 
> From: Derek Young 
> To: core-user@hadoop.apache.org
> Sent: Wednesday, January 21, 2009 1:23:56 PM
> Subject: using distcp for http source files
> 
> I plan to use hadoop to do some log processing and I'm working on a method to 
> load the files (probably nightly) into hdfs.  My plan is to have a web server 
> on 
> each machine with logs that serves up the log directories.  Then I would give 
> distcp a list of http URLs of the log files and have it copy the files in.
> 
> Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this 
> should be supported, but the http URLs are not working for me.  Are http 
> source 
> URLs still supported?
> 
> I tried a simple test with an http source URL (using Hadoop 0.19):
> 
> hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs
> 
> This fails:
> 
> With failures, global counters are inaccurate; consider running with -i
> Copy failed: java.io.IOException: No FileSystem for scheme: http
>at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364)
>at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
>at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578)
>at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74)
>at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775)
>at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)



using distcp for http source files

2009-01-21 Thread Derek Young
I plan to use hadoop to do some log processing and I'm working on a 
method to load the files (probably nightly) into hdfs.  My plan is to 
have a web server on each machine with logs that serves up the log 
directories.  Then I would give distcp a list of http URLs of the log 
files and have it copy the files in.


Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like 
this should be supported, but the http URLs are not working for me.  Are 
http source URLs still supported?


I tried a simple test with an http source URL (using Hadoop 0.19):

hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs

This fails:

With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: No FileSystem for scheme: http
   at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364)

   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
   at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578)
   at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74)
   at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775)
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)



ANN: hbase-0.19.0 release available for download

2009-01-21 Thread stack

HBase 0.19.0 is now available for download

 http://hadoop.apache.org/hbase/releases.html

Thanks to all who contributed to this release.  185 issues have been 
fixed since hbase 0.18.0.   Release notes are available here: 
http://tinyurl.com/8xmyx9


At your service,
The HBase Team



RE: Suitable for Hadoop?

2009-01-21 Thread Zak, Richard [USA]
Darren- I would definitely use HDFS to get the data to all the instances.
I'm not sure about your 32 processes or SQS, but let me/us know what you
find. 


Richard J. Zak

-Original Message-
From: Ricky Ho [mailto:r...@adobe.com] 
Sent: Wednesday, January 21, 2009 15:00
To: core-user@hadoop.apache.org
Subject: RE: Suitable for Hadoop?

Jim, thanks for your explanation.  But isn't isSplittable an option in
writing output rather than reading input ?

There are two phases.

1) Upload the data from local file to HDFS.  Is there an option in the
"hadoop fs copy" to pack multiple small files in a single block and also not
splitting the files.  So should "isSplittable" be applied here ?

2) The Mapper will start reading data using MultiFileInputFormat, which will
data in units of block size.  Therefore a single mapper will get all the
files within its assigned block.  Right ?

Rgds,
Ricky

-Original Message-
From: Jim Twensky [mailto:jim.twen...@gmail.com]
Sent: Wednesday, January 21, 2009 11:47 AM
To: core-user@hadoop.apache.org
Subject: Re: Suitable for Hadoop?

Ricky,

Hadoop was formerly optimized for large files, usually files of size larger
than one input split. However, there is an input format called
MultiFileInputFormat which can be used to utilize Hadoop to work efficiently
on smaller files. You can also set the isSplittable method of an input
format to "false" and ensure that a file is not split into pieces but rather
processed by only one mapper.

Jim

On Wed, Jan 21, 2009 at 9:14 AM, Ricky Ho  wrote:

> Hmmm ...
>
> From a space efficiency perspective, given HDFS (with large block 
> size) is expecting large files, is Hadoop optimized for processing 
> large number of small files ?  Does each file take up at least 1 block 
> ? or multiple files can sit on the same block.
>
> Rgds,
> Ricky
> -Original Message-
> From: Zak, Richard [USA] [mailto:zak_rich...@bah.com]
> Sent: Wednesday, January 21, 2009 6:42 AM
> To: core-user@hadoop.apache.org
> Subject: RE: Suitable for Hadoop?
>
> You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to 
> concatenate them, and the New York times used Hadoop to process a few 
> TB of PDFs.
>
> What I would do is this:
> - Use the iText library, a Java library for PDF manipulation (don't 
> know what you would use for reading Word docs)
> - Don't use any Reducers
> - Have the input be a text file with the directory(ies) to process, 
> since the mapper takes in file contents (and you don't want to read in 
> one line of
> binary)
> - Have the map process all contents for that one given directory from 
> the input text file
> - Break down the documents into more directories to go easier on the 
> memory
> - Use Amazon's EC2, and the scripts in 
> /src/contrib/ec2/bin/ (there is a script which passes 
> environment variables to launched instances, modify the script to 
> allow Hadoop to use more memory by setting the HADOOP_HEAPSIZE 
> environment variable and having the variable properly
> passed)
>
> I realize this isn't the strong point of Map/Reduce or Hadoop, but it 
> still uses the HDFS in a beneficial manner, and the distributed part 
> is very helpful!
>
>
> Richard J. Zak
>
> -Original Message-
> From: Darren Govoni [mailto:dar...@ontrenet.com]
> Sent: Wednesday, January 21, 2009 08:08
> To: core-user@hadoop.apache.org
> Subject: Suitable for Hadoop?
>
> Hi,
>  I have a task to process large quantities of files by converting them 
> into other formats. Each file is processed as a whole and converted to 
> a target format. Since there are 100's of GB of data I thought it 
> suitable for Hadoop, but the problem is, I don't think the files can 
> be broken apart and processed. For example, how would mapreduce work 
> to convert a Word Document to PDF if the file is reduced to blocks? 
> I'm not sure that's possible, or is it?
>
> thanks for any advice
> Darren
>
>


smime.p7s
Description: S/MIME cryptographic signature


RE: Suitable for Hadoop?

2009-01-21 Thread Ricky Ho
Jim, thanks for your explanation.  But isn't isSplittable an option in writing 
output rather than reading input ?

There are two phases.

1) Upload the data from local file to HDFS.  Is there an option in the "hadoop 
fs copy" to pack multiple small files in a single block and also not splitting 
the files.  So should "isSplittable" be applied here ?

2) The Mapper will start reading data using MultiFileInputFormat, which will 
data in units of block size.  Therefore a single mapper will get all the files 
within its assigned block.  Right ?

Rgds,
Ricky

-Original Message-
From: Jim Twensky [mailto:jim.twen...@gmail.com] 
Sent: Wednesday, January 21, 2009 11:47 AM
To: core-user@hadoop.apache.org
Subject: Re: Suitable for Hadoop?

Ricky,

Hadoop was formerly optimized for large files, usually files of size larger
than one input split. However, there is an input format called
MultiFileInputFormat which can be used to utilize Hadoop to work efficiently
on smaller files. You can also set the isSplittable method of an input
format to "false" and ensure that a file is not split into pieces but rather
processed by only one mapper.

Jim

On Wed, Jan 21, 2009 at 9:14 AM, Ricky Ho  wrote:

> Hmmm ...
>
> From a space efficiency perspective, given HDFS (with large block size) is
> expecting large files, is Hadoop optimized for processing large number of
> small files ?  Does each file take up at least 1 block ? or multiple files
> can sit on the same block.
>
> Rgds,
> Ricky
> -Original Message-
> From: Zak, Richard [USA] [mailto:zak_rich...@bah.com]
> Sent: Wednesday, January 21, 2009 6:42 AM
> To: core-user@hadoop.apache.org
> Subject: RE: Suitable for Hadoop?
>
> You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to
> concatenate them, and the New York times used Hadoop to process a few TB of
> PDFs.
>
> What I would do is this:
> - Use the iText library, a Java library for PDF manipulation (don't know
> what you would use for reading Word docs)
> - Don't use any Reducers
> - Have the input be a text file with the directory(ies) to process, since
> the mapper takes in file contents (and you don't want to read in one line
> of
> binary)
> - Have the map process all contents for that one given directory from the
> input text file
> - Break down the documents into more directories to go easier on the memory
> - Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/
> (there is a script which passes environment variables to launched
> instances,
> modify the script to allow Hadoop to use more memory by setting the
> HADOOP_HEAPSIZE environment variable and having the variable properly
> passed)
>
> I realize this isn't the strong point of Map/Reduce or Hadoop, but it still
> uses the HDFS in a beneficial manner, and the distributed part is very
> helpful!
>
>
> Richard J. Zak
>
> -Original Message-
> From: Darren Govoni [mailto:dar...@ontrenet.com]
> Sent: Wednesday, January 21, 2009 08:08
> To: core-user@hadoop.apache.org
> Subject: Suitable for Hadoop?
>
> Hi,
>  I have a task to process large quantities of files by converting them into
> other formats. Each file is processed as a whole and converted to a target
> format. Since there are 100's of GB of data I thought it suitable for
> Hadoop, but the problem is, I don't think the files can be broken apart and
> processed. For example, how would mapreduce work to convert a Word Document
> to PDF if the file is reduced to blocks? I'm not sure that's possible, or
> is
> it?
>
> thanks for any advice
> Darren
>
>


Re: Suitable for Hadoop?

2009-01-21 Thread Jim Twensky
Ricky,

Hadoop was formerly optimized for large files, usually files of size larger
than one input split. However, there is an input format called
MultiFileInputFormat which can be used to utilize Hadoop to work efficiently
on smaller files. You can also set the isSplittable method of an input
format to "false" and ensure that a file is not split into pieces but rather
processed by only one mapper.

Jim

On Wed, Jan 21, 2009 at 9:14 AM, Ricky Ho  wrote:

> Hmmm ...
>
> From a space efficiency perspective, given HDFS (with large block size) is
> expecting large files, is Hadoop optimized for processing large number of
> small files ?  Does each file take up at least 1 block ? or multiple files
> can sit on the same block.
>
> Rgds,
> Ricky
> -Original Message-
> From: Zak, Richard [USA] [mailto:zak_rich...@bah.com]
> Sent: Wednesday, January 21, 2009 6:42 AM
> To: core-user@hadoop.apache.org
> Subject: RE: Suitable for Hadoop?
>
> You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to
> concatenate them, and the New York times used Hadoop to process a few TB of
> PDFs.
>
> What I would do is this:
> - Use the iText library, a Java library for PDF manipulation (don't know
> what you would use for reading Word docs)
> - Don't use any Reducers
> - Have the input be a text file with the directory(ies) to process, since
> the mapper takes in file contents (and you don't want to read in one line
> of
> binary)
> - Have the map process all contents for that one given directory from the
> input text file
> - Break down the documents into more directories to go easier on the memory
> - Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/
> (there is a script which passes environment variables to launched
> instances,
> modify the script to allow Hadoop to use more memory by setting the
> HADOOP_HEAPSIZE environment variable and having the variable properly
> passed)
>
> I realize this isn't the strong point of Map/Reduce or Hadoop, but it still
> uses the HDFS in a beneficial manner, and the distributed part is very
> helpful!
>
>
> Richard J. Zak
>
> -Original Message-
> From: Darren Govoni [mailto:dar...@ontrenet.com]
> Sent: Wednesday, January 21, 2009 08:08
> To: core-user@hadoop.apache.org
> Subject: Suitable for Hadoop?
>
> Hi,
>  I have a task to process large quantities of files by converting them into
> other formats. Each file is processed as a whole and converted to a target
> format. Since there are 100's of GB of data I thought it suitable for
> Hadoop, but the problem is, I don't think the files can be broken apart and
> processed. For example, how would mapreduce work to convert a Word Document
> to PDF if the file is reduced to blocks? I'm not sure that's possible, or
> is
> it?
>
> thanks for any advice
> Darren
>
>


RE: Suitable for Hadoop?

2009-01-21 Thread Darren Govoni
Richard,
   Thanks for the suggestion. I actually am building an EC2 architecture
to facilitate this! I tried using a database to warehouse the files, and
then NFS but the connection load is too heavy. So I thought maybe HDFS
could be used just too mitigate the data access across all the
instances. I have a parallel processing architecture based on SQS
queues, but will consider a map process. 
   I have about 32 processes or so per machine in EC2 reading from SQS
queues for files to process, they could then efficiently get the files
from HDFS, yes? Without bottlenecking access to a database or NFS
server?

Well, i will test this direction and see too.

Thank you!
Darren

On Wed, 2009-01-21 at 09:41 -0500, Zak, Richard [USA] wrote:
> You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to
> concatenate them, and the New York times used Hadoop to process a few TB of
> PDFs.
> 
> What I would do is this:
> - Use the iText library, a Java library for PDF manipulation (don't know
> what you would use for reading Word docs)
> - Don't use any Reducers
> - Have the input be a text file with the directory(ies) to process, since
> the mapper takes in file contents (and you don't want to read in one line of
> binary)
> - Have the map process all contents for that one given directory from the
> input text file
> - Break down the documents into more directories to go easier on the memory
> - Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/
> (there is a script which passes environment variables to launched instances,
> modify the script to allow Hadoop to use more memory by setting the
> HADOOP_HEAPSIZE environment variable and having the variable properly
> passed)
> 
> I realize this isn't the strong point of Map/Reduce or Hadoop, but it still
> uses the HDFS in a beneficial manner, and the distributed part is very
> helpful!
> 
> 
> Richard J. Zak
> 
> -Original Message-
> From: Darren Govoni [mailto:dar...@ontrenet.com] 
> Sent: Wednesday, January 21, 2009 08:08
> To: core-user@hadoop.apache.org
> Subject: Suitable for Hadoop?
> 
> Hi,
>   I have a task to process large quantities of files by converting them into
> other formats. Each file is processed as a whole and converted to a target
> format. Since there are 100's of GB of data I thought it suitable for
> Hadoop, but the problem is, I don't think the files can be broken apart and
> processed. For example, how would mapreduce work to convert a Word Document
> to PDF if the file is reduced to blocks? I'm not sure that's possible, or is
> it?
> 
> thanks for any advice
> Darren
> 



RE: Suitable for Hadoop?

2009-01-21 Thread Ricky Ho
Hmmm ...

>From a space efficiency perspective, given HDFS (with large block size) is 
>expecting large files, is Hadoop optimized for processing large number of 
>small files ?  Does each file take up at least 1 block ? or multiple files can 
>sit on the same block.

Rgds,
Ricky
-Original Message-
From: Zak, Richard [USA] [mailto:zak_rich...@bah.com] 
Sent: Wednesday, January 21, 2009 6:42 AM
To: core-user@hadoop.apache.org
Subject: RE: Suitable for Hadoop?

You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to
concatenate them, and the New York times used Hadoop to process a few TB of
PDFs.

What I would do is this:
- Use the iText library, a Java library for PDF manipulation (don't know
what you would use for reading Word docs)
- Don't use any Reducers
- Have the input be a text file with the directory(ies) to process, since
the mapper takes in file contents (and you don't want to read in one line of
binary)
- Have the map process all contents for that one given directory from the
input text file
- Break down the documents into more directories to go easier on the memory
- Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/
(there is a script which passes environment variables to launched instances,
modify the script to allow Hadoop to use more memory by setting the
HADOOP_HEAPSIZE environment variable and having the variable properly
passed)

I realize this isn't the strong point of Map/Reduce or Hadoop, but it still
uses the HDFS in a beneficial manner, and the distributed part is very
helpful!


Richard J. Zak

-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com] 
Sent: Wednesday, January 21, 2009 08:08
To: core-user@hadoop.apache.org
Subject: Suitable for Hadoop?

Hi,
  I have a task to process large quantities of files by converting them into
other formats. Each file is processed as a whole and converted to a target
format. Since there are 100's of GB of data I thought it suitable for
Hadoop, but the problem is, I don't think the files can be broken apart and
processed. For example, how would mapreduce work to convert a Word Document
to PDF if the file is reduced to blocks? I'm not sure that's possible, or is
it?

thanks for any advice
Darren



RE: Suitable for Hadoop?

2009-01-21 Thread Zak, Richard [USA]
You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to
concatenate them, and the New York times used Hadoop to process a few TB of
PDFs.

What I would do is this:
- Use the iText library, a Java library for PDF manipulation (don't know
what you would use for reading Word docs)
- Don't use any Reducers
- Have the input be a text file with the directory(ies) to process, since
the mapper takes in file contents (and you don't want to read in one line of
binary)
- Have the map process all contents for that one given directory from the
input text file
- Break down the documents into more directories to go easier on the memory
- Use Amazon's EC2, and the scripts in /src/contrib/ec2/bin/
(there is a script which passes environment variables to launched instances,
modify the script to allow Hadoop to use more memory by setting the
HADOOP_HEAPSIZE environment variable and having the variable properly
passed)

I realize this isn't the strong point of Map/Reduce or Hadoop, but it still
uses the HDFS in a beneficial manner, and the distributed part is very
helpful!


Richard J. Zak

-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com] 
Sent: Wednesday, January 21, 2009 08:08
To: core-user@hadoop.apache.org
Subject: Suitable for Hadoop?

Hi,
  I have a task to process large quantities of files by converting them into
other formats. Each file is processed as a whole and converted to a target
format. Since there are 100's of GB of data I thought it suitable for
Hadoop, but the problem is, I don't think the files can be broken apart and
processed. For example, how would mapreduce work to convert a Word Document
to PDF if the file is reduced to blocks? I'm not sure that's possible, or is
it?

thanks for any advice
Darren



smime.p7s
Description: S/MIME cryptographic signature


Re: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Tom White
Hi Matthias,

It is not necessary to have SSH set up to run Hadoop, but it does make
things easier. SSH is used by the scripts in the bin directory which
start and stop daemons across the cluster (the slave nodes are defined
in the slaves file), see the start-all.sh script as a starting point.
These scripts are a convenient way to control Hadoop, but there are
other possibilities. If you had another system to control daemons on
your cluster then you wouldn't need SSH.

Tom

On Wed, Jan 21, 2009 at 1:20 PM, Matthias Scherer
 wrote:
> Hi Steve and Amit,
>
> Thanks for your answers. I agree with you that key-based ssh is nothing to 
> worry about. But I'm wondering what exactly - that means wich grid 
> administration tasks - hadoop does via ssh?! Does it restart crashed data 
> nodes or tasks trackers on the slaves? Oder does it transfer data over the 
> grid with ssh access? How can I find a short description what exactly hadoop 
> needs ssh for? The documentation says only that I have to configure it.
>
> Thanks & Regards
> Matthias
>
>
>> -Ursprüngliche Nachricht-
>> Von: Steve Loughran [mailto:ste...@apache.org]
>> Gesendet: Mittwoch, 21. Januar 2009 13:59
>> An: core-user@hadoop.apache.org
>> Betreff: Re: Why does Hadoop need ssh access to master and slaves?
>>
>> Amit k. Saha wrote:
>> > On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer
>> >  wrote:
>> >> Hi all,
>> >>
>> >> we've made our first steps in evaluating hadoop. The setup
>> of 2 VMs
>> >> as a hadoop grid was very easy and works fine.
>> >>
>> >> Now our operations team wonders why hadoop has to be able
>> to connect
>> >> to the master and slaves via password-less ssh?! Can
>> anyone give us
>> >> an answer to this question?
>> >
>> > 1. There has to be a way to connect to the remote hosts-
>> slaves and a
>> > secondary master, and SSH is the secure way to do it 2. It
>> has to be
>> > password-less to enable automatic logins
>> >
>>
>> SSH is *a * secure way to do it, but not the only way. Other
>> management tools can bring up hadoop clusters. Hadoop ships
>> with scripted support for SSH as it is standard with Linux
>> distros and generally the best way to bring up a remote console.
>>
>> Matthias,
>> Your ops team should not be worrying about the SSH security,
>> as long as they keep their keys under control.
>>
>> (a) Key-based SSH is more secure than passworded SSH, as
>> man-in-middle attacks are prevented. passphrase protected SSH
>> keys on external USB keys even better.
>>
>> (b) once the cluster is up, that filesystem is pretty
>> vulnerable to anything on the LAN. You do need to lock down
>> your datacentre, or set up the firewall/routing of the
>> servers so that only trusted hosts can talk to the FS. SSH
>> becomes a detail at that point.
>>
>>
>>
>


Re: AW: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Steve Loughran

Matthias Scherer wrote:

Hi Steve and Amit,

Thanks for your answers. I agree with you that key-based ssh is nothing to 
worry about. But I'm wondering what exactly - that means wich grid 
administration tasks - hadoop does via ssh?! Does it restart crashed data nodes 
or tasks trackers on the slaves? Oder does it transfer data over the grid with 
ssh access? How can I find a short description what exactly hadoop needs ssh 
for? The documentation says only that I have to configure it.

Thanks & Regards
Matthias



SSH is used by the various scripts in bin/ to start and stop clusters, 
slaves.sh does the work, the other ones (like hadoop-daemons.sh) use it 
to run stuff on the machines.


The EC2 scripts use SSH to talk to the machines brought up there; when 
you ask amazon for machines, you give it a public key to be set to the 
allowed keys list of root; you use that to ssh in and run code.


There is currently no liveness/restarting built into the scripts; you 
need other things to do that. I am working on this, with  HADOOP-3628, 
https://issues.apache.org/jira/browse/HADOOP-3628


I will be showing some other management options at ApacheCon EU 2009, 
which being on the same continent and timezone is something you may want 
to consider attending; lots of Hadoop people will be there, with some 
all-day sessions on it.

http://eu.apachecon.com/c/aceu2009/sessions/227

One big problem with cluster management is not just recognising failed 
nodes, it's handling them. The actions you take are different with a 
VM-cluster like EC2 (fix: reboot, then kill that AMI and create a new 
one), from that of a VM-ware/Xen-managed cluster, to that of physical 
systems (Y!: phone Allen, us: email paolo). Once we have the health 
monitoring in there different people will need to apply their own policies.


-steve

--
Steve Loughran  http://www.1060.org/blogxter/publish/5



AW: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Matthias Scherer
Hi Steve and Amit,

Thanks for your answers. I agree with you that key-based ssh is nothing to 
worry about. But I'm wondering what exactly - that means wich grid 
administration tasks - hadoop does via ssh?! Does it restart crashed data nodes 
or tasks trackers on the slaves? Oder does it transfer data over the grid with 
ssh access? How can I find a short description what exactly hadoop needs ssh 
for? The documentation says only that I have to configure it.

Thanks & Regards
Matthias


> -Ursprüngliche Nachricht-
> Von: Steve Loughran [mailto:ste...@apache.org] 
> Gesendet: Mittwoch, 21. Januar 2009 13:59
> An: core-user@hadoop.apache.org
> Betreff: Re: Why does Hadoop need ssh access to master and slaves?
> 
> Amit k. Saha wrote:
> > On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer 
> >  wrote:
> >> Hi all,
> >>
> >> we've made our first steps in evaluating hadoop. The setup 
> of 2 VMs 
> >> as a hadoop grid was very easy and works fine.
> >>
> >> Now our operations team wonders why hadoop has to be able 
> to connect 
> >> to the master and slaves via password-less ssh?! Can 
> anyone give us 
> >> an answer to this question?
> > 
> > 1. There has to be a way to connect to the remote hosts- 
> slaves and a 
> > secondary master, and SSH is the secure way to do it 2. It 
> has to be 
> > password-less to enable automatic logins
> > 
> 
> SSH is *a * secure way to do it, but not the only way. Other 
> management tools can bring up hadoop clusters. Hadoop ships 
> with scripted support for SSH as it is standard with Linux 
> distros and generally the best way to bring up a remote console.
> 
> Matthias,
> Your ops team should not be worrying about the SSH security, 
> as long as they keep their keys under control.
> 
> (a) Key-based SSH is more secure than passworded SSH, as 
> man-in-middle attacks are prevented. passphrase protected SSH 
> keys on external USB keys even better.
> 
> (b) once the cluster is up, that filesystem is pretty 
> vulnerable to anything on the LAN. You do need to lock down 
> your datacentre, or set up the firewall/routing of the 
> servers so that only trusted hosts can talk to the FS. SSH 
> becomes a detail at that point.
> 
> 
> 


Suitable for Hadoop?

2009-01-21 Thread Darren Govoni
Hi,
  I have a task to process large quantities of files by converting them
into other formats. Each file is processed as a whole and converted to a
target format. Since there are 100's of GB of data I thought it suitable
for Hadoop, but the problem is, I don't think the files can be broken
apart and processed. For example, how would mapreduce work to convert a
Word Document to PDF if the file is reduced to blocks? I'm not sure
that's possible, or is it?

thanks for any advice
Darren



Re: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Steve Loughran

Amit k. Saha wrote:

On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer
 wrote:

Hi all,

we've made our first steps in evaluating hadoop. The setup of 2 VMs as a
hadoop grid was very easy and works fine.

Now our operations team wonders why hadoop has to be able to connect to
the master and slaves via password-less ssh?! Can anyone give us an
answer to this question?


1. There has to be a way to connect to the remote hosts- slaves and a
secondary master, and SSH is the secure way to do it
2. It has to be password-less to enable automatic logins



SSH is *a * secure way to do it, but not the only way. Other management 
tools can bring up hadoop clusters. Hadoop ships with scripted support 
for SSH as it is standard with Linux distros and generally the best way 
to bring up a remote console.


Matthias,
Your ops team should not be worrying about the SSH security, as long as 
they keep their keys under control.


(a) Key-based SSH is more secure than passworded SSH, as man-in-middle 
attacks are prevented. passphrase protected SSH keys on external USB 
keys even better.


(b) once the cluster is up, that filesystem is pretty vulnerable to 
anything on the LAN. You do need to lock down your datacentre, or set up 
the firewall/routing of the servers so that only trusted hosts can talk 
to the FS. SSH becomes a detail at that point.





Re: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Amit k. Saha
On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer
 wrote:
> Hi all,
>
> we've made our first steps in evaluating hadoop. The setup of 2 VMs as a
> hadoop grid was very easy and works fine.
>
> Now our operations team wonders why hadoop has to be able to connect to
> the master and slaves via password-less ssh?! Can anyone give us an
> answer to this question?

1. There has to be a way to connect to the remote hosts- slaves and a
secondary master, and SSH is the secure way to do it
2. It has to be password-less to enable automatic logins

-Amit


>
> Thanks & Regards
> Matthias
>



-- 
Amit Kumar Saha
http://amitksaha.blogspot.com
http://amitsaha.in.googlepages.com/
*Bangalore Open Java Users Group*:http:www.bojug.in


Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Matthias Scherer
Hi all,

we've made our first steps in evaluating hadoop. The setup of 2 VMs as a
hadoop grid was very easy and works fine.

Now our operations team wonders why hadoop has to be able to connect to
the master and slaves via password-less ssh?! Can anyone give us an
answer to this question?

Thanks & Regards
Matthias


Re: Null Pointer with Pattern file

2009-01-21 Thread Rasit OZDAS
Hi,
Try to use:

conf.setJarByClass(EchoOche.class);  // conf is the JobConf instance of your
example.

Hope this helps,
Rasit

2009/1/20 Shyam Sarkar 

> Hi,
>
> I was trying to run Hadoop wordcount version 2 example under Cygwin. I
> tried
> without pattern.txt file -- It works fine.
> I tried with pattern.txt file to skip some patterns, I get NULL POINTER
> exception as follows::
>
> 09/01/20 12:56:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 09/01/20 12:56:17 WARN mapred.JobClient: No job jar file set.  User classes
> may not be found. See JobConf(Class) or JobConf#setJar(String).
> 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process
> : 4
> 09/01/20 12:56:17 INFO mapred.JobClient: Running job: job_local_0001
> 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process
> : 4
> 09/01/20 12:56:17 INFO mapred.MapTask: numReduceTasks: 1
> 09/01/20 12:56:17 INFO mapred.MapTask: io.sort.mb = 100
> 09/01/20 12:56:17 INFO mapred.MapTask: data buffer = 79691776/99614720
> 09/01/20 12:56:17 INFO mapred.MapTask: record buffer = 262144/327680
> 09/01/20 12:56:17 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.NullPointerException
>  at org.myorg.WordCount$Map.configure(WordCount.java:39)
>  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>  at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
>  at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>  at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
>  at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> java.io.IOException: Job failed!
>  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>  at org.myorg.WordCount.run(WordCount.java:114)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at org.myorg.WordCount.main(WordCount.java:119)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>  at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>  at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
>
> Please tell me what should I do.
>
> Thanks,
> shyam.s.sar...@gmail.com
>



-- 
M. Raşit ÖZDAŞ


Re: streaming split sizes

2009-01-21 Thread tim robertson
Hi Dmitry,


What version of hadoop are you using?

Assuming your 3G DB is a read only lookup... can you load it into
memory in the Map.configure and then use (0.19+ only...):


  mapred.job.reuse.jvm.num.tasks
  -1


So that the Maps are reused for all time and the in memory DB is not reloaded?

This is what I did for cross referencing Point data (the input) with
in-memory index of Polygons (the lookup DB) to find which polygons the
points fell in 
(http://biodivertido.blogspot.com/2008/11/reproducing-spatial-joins-using-hadoop.html).

Cheers,
Tim



On Wed, Jan 21, 2009 at 4:07 AM, Dmitry Pushkarev  wrote:
> Well, database is specifically designed to fit into memory and if it is not
> it will slow things down hundreds of time. One simple hack I came to is to
> replace map tasks by /bin/cat and then run 150 reducers that will have
> database constantly in memory. Parallelism is also not a problems, since
> we're running very small (15 nodes, 120 cores) specifically built for the
> task.
>
> ---
> Dmitry Pushkarev
> +1-650-644-8988
>
> -Original Message-
> From: Delip Rao [mailto:delip...@gmail.com]
> Sent: Tuesday, January 20, 2009 6:19 PM
> To: core-user@hadoop.apache.org
> Subject: Re: streaming split sizes
>
> Hi Dmitry,
>
> Not a direct answer to your question but I think the right approach
> would be to not load your database into memory during config() but
> instead lookup the database from map() via Hbase or something similar.
> That way you don't have to worry about the split sizes. In fact using
> fewer splits would limit the parallelism you can achieve, given that
> your maps are so fast.
>
> - delip
>
> On Tue, Jan 20, 2009 at 8:25 PM, Dmitry Pushkarev  wrote:
>> Hi.
>>
>>
>>
>> I'm running streaming on relatively big (2Tb) dataset, which is  being
> split
>> by hadoop in 64mb pieces.  One of the problems I have with that is my map
>> tasks take very long time to initialize (they need to load 3GB database
> into
>> RAM) and they are finishing these 64mb in 10 seconds.
>>
>>
>>
>> So I'm wondering if there is any way to make hadoop give larger datasets
> to
>> map jobs? (trivial way, of course would be to split dataset to N files and
>> make it feed one file at a time, but is there any standard solution for
>> that?)
>>
>>
>>
>> Thanks.
>>
>>
>>
>> ---
>>
>> Dmitry Pushkarev
>>
>> +1-650-644-8988
>>
>>
>>
>>
>
>