Handling of small files in hadoop

2011-09-14 Thread Naveen Mahale
Hi all, I use hadoop-0.21.0 distribution. I have a large number of small files (KB). Is there any efficient way of handling it in hadoop? I have heard that solution for that problem is using: 1. HAR (hadoop archives) 2. cat on files I would like to know if there are any

Setting permissions on startup, during safe mode

2011-09-14 Thread Ossi
hi, every time after starting our hadoop cluster (using Cloudera's) this message appears: 2011-09-13 04:35:05,207 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode extension entered. The reported blocks 8995 has reached the threshold 0.9990 of total blocks 9005. Safe mode will be turned

Re: Handling of small files in hadoop

2011-09-14 Thread Joey Echeverria
Hi Naveen, I use hadoop-0.21.0 distribution. I have a large number of small files (KB). Word of warning, 0.21 is not a stable release. The recommended version is in the 0.20.x range. Is there any efficient way of handling it in hadoop? I have heard that solution for that problem is using:  

Re: Hadoop Streaming job Fails - Permission Denied error

2011-09-14 Thread Brock Noland
Hi, This probably belongs on mapreduce-user as opposed to common-user. I have BCC'ed the common-user group. Generally it's a best practice to ship the scripts with the job. Like so: hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar -input

Re: Running example application with capacity scheduler ?

2011-09-14 Thread Thomas Graves
I believe it defaults to submit a job to the default queue if you don't specify it. You don't have the default queue defined in your list of mapred.queue.names. So add -Dmapred.job.queue.name=myqueue1 (or another queue you have defined) to the wordcount command like: bin/hadoop jar

Am i crazy? - question about hadoop streaming

2011-09-14 Thread Mark Kerzner
Hi, I am using the latest Cloudera distribution, and with that I am able to use the latest Hadoop API, which I believe is 0.21, for such things as import org.apache.hadoop.mapreduce.Reducer; So I am using mapreduce, not mapred, and everything works fine. However, in a small streaming job,

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Konstantin Boudnik
I am sure if you ask at provider's specific list you'll get a better answer than from common Hadoop list ;) Cos On Wed, Sep 14, 2011 at 09:48PM, Mark Kerzner wrote: Hi, I am using the latest Cloudera distribution, and with that I am able to use the latest Hadoop API, which I believe is

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Mark Kerzner
I am sorry, you are right. mark On Wed, Sep 14, 2011 at 9:52 PM, Konstantin Boudnik c...@apache.org wrote: I am sure if you ask at provider's specific list you'll get a better answer than from common Hadoop list ;) Cos On Wed, Sep 14, 2011 at 09:48PM, Mark Kerzner wrote: Hi, I am

Re: Handling of small files in hadoop

2011-09-14 Thread Naveen Mahale
Hey, thanks Joey for that information. Would work on what you said. Regards Naveen Mahale On Wed, Sep 14, 2011 at 5:32 PM, Joey Echeverria j...@cloudera.com wrote: Hi Naveen, I use hadoop-0.21.0 distribution. I have a large number of small files (KB). Word of warning, 0.21 is not a

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Prashant
On 09/15/2011 08:18 AM, Mark Kerzner wrote: Hi, I am using the latest Cloudera distribution, and with that I am able to use the latest Hadoop API, which I believe is 0.21, for such things as import org.apache.hadoop.mapreduce.Reducer; So I am using mapreduce, not mapred, and everything works

Re: Am i crazy? - question about hadoop streaming

2011-09-14 Thread Mark Kerzner
Thank you, Prashant, it seems so. I already verified this by refactoring the code to use 0.20 API as well as 0.21 API in two different packages, and streaming happily works with 0.20. Mark On Wed, Sep 14, 2011 at 11:46 PM, Prashant prashan...@imaginea.com wrote: On 09/15/2011 08:18 AM, Mark