Re: streaming split sizes

2009-01-21 Thread tim robertson
Hi Dmitry, What version of hadoop are you using? Assuming your 3G DB is a read only lookup... can you load it into memory in the Map.configure and then use (0.19+ only...): property namemapred.job.reuse.jvm.num.tasks/name value-1/value /property So that the Maps are reused for all time

Re: Null Pointer with Pattern file

2009-01-21 Thread Rasit OZDAS
Hi, Try to use: conf.setJarByClass(EchoOche.class); // conf is the JobConf instance of your example. Hope this helps, Rasit 2009/1/20 Shyam Sarkar shyam.s.sar...@gmail.com Hi, I was trying to run Hadoop wordcount version 2 example under Cygwin. I tried without pattern.txt file -- It

Re: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Steve Loughran
Amit k. Saha wrote: On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer matthias.sche...@1und1.de wrote: Hi all, we've made our first steps in evaluating hadoop. The setup of 2 VMs as a hadoop grid was very easy and works fine. Now our operations team wonders why hadoop has to be able to

Suitable for Hadoop?

2009-01-21 Thread Darren Govoni
Hi, I have a task to process large quantities of files by converting them into other formats. Each file is processed as a whole and converted to a target format. Since there are 100's of GB of data I thought it suitable for Hadoop, but the problem is, I don't think the files can be broken apart

AW: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Matthias Scherer
Hi Steve and Amit, Thanks for your answers. I agree with you that key-based ssh is nothing to worry about. But I'm wondering what exactly - that means wich grid administration tasks - hadoop does via ssh?! Does it restart crashed data nodes or tasks trackers on the slaves? Oder does it

Re: AW: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Steve Loughran
Matthias Scherer wrote: Hi Steve and Amit, Thanks for your answers. I agree with you that key-based ssh is nothing to worry about. But I'm wondering what exactly - that means wich grid administration tasks - hadoop does via ssh?! Does it restart crashed data nodes or tasks trackers on the

Re: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Tom White
Hi Matthias, It is not necessary to have SSH set up to run Hadoop, but it does make things easier. SSH is used by the scripts in the bin directory which start and stop daemons across the cluster (the slave nodes are defined in the slaves file), see the start-all.sh script as a starting point.

RE: Suitable for Hadoop?

2009-01-21 Thread Zak, Richard [USA]
You can do that. I did a Map/Reduce job for about 6 GB of PDFs to concatenate them, and the New York times used Hadoop to process a few TB of PDFs. What I would do is this: - Use the iText library, a Java library for PDF manipulation (don't know what you would use for reading Word docs) - Don't

RE: Suitable for Hadoop?

2009-01-21 Thread Ricky Ho
Hmmm ... From a space efficiency perspective, given HDFS (with large block size) is expecting large files, is Hadoop optimized for processing large number of small files ? Does each file take up at least 1 block ? or multiple files can sit on the same block. Rgds, Ricky -Original

RE: Suitable for Hadoop?

2009-01-21 Thread Darren Govoni
Richard, Thanks for the suggestion. I actually am building an EC2 architecture to facilitate this! I tried using a database to warehouse the files, and then NFS but the connection load is too heavy. So I thought maybe HDFS could be used just too mitigate the data access across all the

Re: Suitable for Hadoop?

2009-01-21 Thread Jim Twensky
Ricky, Hadoop was formerly optimized for large files, usually files of size larger than one input split. However, there is an input format called MultiFileInputFormat which can be used to utilize Hadoop to work efficiently on smaller files. You can also set the isSplittable method of an input

RE: Suitable for Hadoop?

2009-01-21 Thread Ricky Ho
Jim, thanks for your explanation. But isn't isSplittable an option in writing output rather than reading input ? There are two phases. 1) Upload the data from local file to HDFS. Is there an option in the hadoop fs copy to pack multiple small files in a single block and also not splitting

RE: Suitable for Hadoop?

2009-01-21 Thread Zak, Richard [USA]
Darren- I would definitely use HDFS to get the data to all the instances. I'm not sure about your 32 processes or SQS, but let me/us know what you find. Richard J. Zak -Original Message- From: Ricky Ho [mailto:r...@adobe.com] Sent: Wednesday, January 21, 2009 15:00 To:

ANN: hbase-0.19.0 release available for download

2009-01-21 Thread stack
HBase 0.19.0 is now available for download http://hadoop.apache.org/hbase/releases.html Thanks to all who contributed to this release. 185 issues have been fixed since hbase 0.18.0. Release notes are available here: http://tinyurl.com/8xmyx9 At your service, The HBase Team

using distcp for http source files

2009-01-21 Thread Derek Young
:7274/logs/log.20090121 /user/dyoung/mylogs This fails: With failures, global counters are inaccurate; consider running with -i Copy failed: java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364

Re: using distcp for http source files

2009-01-21 Thread Tsz Wo (Nicholas), Sze
Hi Derek, The http in http://core:7274/logs/log.20090121; should be hftp. hftp is the scheme name of HftpFileSystem which uses http for accessing hdfs. Hope this helps. Nicholas Sze - Original Message From: Derek Young dyo...@kayak.com To: core-user@hadoop.apache.org Sent

Re: using distcp for http source files

2009-01-21 Thread Derek Young
Tsz Wo (Nicholas), Sze s29752-hadoopu...@... writes: Hi Derek, The http in http://core:7274/logs/log.20090121; should be hftp. hftp is the scheme name of HftpFileSystem which uses http for accessing hdfs. Hope this helps. Nicholas Sze I thought hftp is used to talk to servlets

running hadoop on heterogeneous hardware

2009-01-21 Thread Bill Au
Is hadoop designed to run on homogeneous hardware only, or does it work just as well on heterogeneous hardware as well? If the datanodes have different disk capacities, does HDFS still spread the data blocks equally amount all the datanodes, or will the datanodes with high disk capacity end up

RE: Hadoop User Group Meeting (Bay Area) 1/21

2009-01-21 Thread Ajay Anand
Reminder - the Bay Area Hadoop User Group meeting is today at 6 pm. From: Ajay Anand Sent: Thursday, January 08, 2009 12:10 PM To: 'core-user@hadoop.apache.org'; 'gene...@hadoop.apache.org'; 'zookeeper-u...@hadoop.apache.org'; 'hbase-u...@hadoop.apache.org';

Re: using distcp for http source files

2009-01-21 Thread Doug Cutting
Derek Young wrote: Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this should be supported, but the http URLs are not working for me. Are http source URLs still supported? No. They used to be supported, but when distcp was converted to accept any Path this stopped

Decommissioning Nodes

2009-01-21 Thread Hargraves, Alyssa
Hello Hadoop Users, I was hoping someone would be able to answer a question about node decommissioning. I have a test Hadoop cluster set up which only consists of my computer and a master node. I am looking at the removal and addition of nodes. Adding a node is nearly instant (only about 5

Re: Decommissioning Nodes

2009-01-21 Thread Jeremy Chow
Hey Alyssa, If one of those datanodes down, a few minutes will pass when master discover this phenomenon. Master node takes those nodes which have not send heatbeat for quite a while as dead ones. On Thu, Jan 22, 2009 at 8:34 AM, Hargraves, Alyssa aly...@wpi.edu wrote: Hello Hadoop Users, I