Distributed Clusters

2010-04-07 Thread James Seigel
I am new to this group, and relatively new to hadoop. 

I am looking at building a large cluster.  I was wondering if anyone has any 
best practices for a cluster in the hundreds of nodes?  As well, has anyone had 
experience with a cluster spanning multiple data centers.  Is this a bad 
practice? moderately bad practice?  insane?

Is it better to build the 1000 node cluster in a single data center?  Do you 
back one of these things up to a second data center or a different 1000 node 
cluster?

Sorry, I am asking crazy questions...I am just wanting to learn the meta issues 
and opportunities with making clusters.

Thanks for your ideas!

Cheers
James.




Re: Distributed Clusters

2010-04-08 Thread James Seigel
Thanks for the insights into this stuff so far.  I think we are doing 
somethings right with automating everything and such.  An additional question I 
have is: I have heard rhetoric about zookeeper being able to help with 
configurations of hadoop?  I was wondering if anyone is using zookeeper in a 
way that helps with their deployment of the hadoop cluster?

Cheers
James.


On 2010-04-08, at 4:18 AM, Steve Loughran wrote:

> James Seigel wrote:
>> I am new to this group, and relatively new to hadoop. I am looking at 
>> building a large cluster.  I was wondering if anyone has any best practices 
>> for a cluster in the hundreds of nodes?  As well, has anyone had experience 
>> with a cluster spanning multiple data centers.  Is this a bad practice? 
>> moderately bad practice?  insane?
> 
> got some stuff here
> http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment
> 
> though my clusters are of short life span and smaller. At that kind of scale 
> you need to know how to manage datacenters yourself or talk to people who do 
> (I deny all knowledge, though I will note that in HP consulting and EDS we do 
> have people who can handle this)
> 
>> Is it better to build the 1000 node cluster in a single data center?  
> 
> yes.
> 
>> Do you back one of these things up to a second data center or a different 
>> 1000 node cluster?
> 
> depends on your concerns and where the building is.
> 
> -If your facility is in the Bay Area then you want a separate datacentre on a 
> different fault line. If it's in Easter WA or OR then you worry more about 
> volcanic activity and spec the roof to take 1-2m of volcanic ash. Power comes 
> off the big dams which again may go down if there's an earthquake, but 
> otherwise pretty reliable.
> 
> -if your worry is about continuous availability, you need different sites 
> with different (multiple) power suppliers and multiple data feeds, and more 
> to worry about in terms of keeping things in sync. Data transfer will cost 
> time and money, and for a big enough cluster -1000 servers can go up to 6-12 
> PB of storage, which takes time to sync. Even with the CERN LHC experiments 
> data rate of 1 PB/month off the LHC, it would take 6 months to get the data 
> in to your cluster using a good protocol like GridFTP.
> 
> -single site would make sync easier, 10GB ethernet will still take a while 
> but not cost you
> 
>> Sorry, I am asking crazy questions...I am just wanting to learn the meta 
>> issues and opportunities with making clusters.
> 
> Start small, automate everything, worry about scaling up the management 
> problems. Hadoop filestore and JT scales well, but you have to get your ops 
> right. That's everything from BIOS upgrades to log file management.

James Seigel
ja...@tynt.com
http://www.tynt.com
Captain Hammer



Re: copying file into hdfs

2010-04-10 Thread James Seigel
Maybe copy your hdfs config here and we can see why it took up 16 gigs  
of space.


Cheers

Sent from my mobile. Please excuse the typos.

On 2010-04-10, at 3:22 PM, "Michael Segel"   
wrote:





Mike,

First, you need to see what you set your block size to in Hadoop. By  
default its 64MB. With large files, you may want to bump that up to  
128 MB per block.

2GB file will give you roughly 20 m/r jobs.

I'd use hadoop fs -copyFromLocal  .

(Ok, I'm going from memory on the hadoop command, but you can always  
do a hadoop help to see the command.)


Also you need to see what you set for your replication factor.  
Usually its 3.


The your 2GB file will be roughly 6GB in size and should be balanced  
on all of the nodes with 2 or 3 blocks per machine.


HTH

-Mike


Date: Sat, 10 Apr 2010 14:03:02 -0400
Subject: copying file into hdfs
From: nulleck...@gmail.com
To: common-user@hadoop.apache.org

Hi,

Im mike,
I am a new user of Hadoop. currently, I have a cluster of 8  
machines and a

file of size 2 gigs.
When I load it into hdfs using command
hadoop dfs -put /a.dat /data
It actually loads it on all data nodes. dfsadmin -report shows hdfs  
usage to

16 gigs. And it is taking 2 hours to load that data file.

with 1 node - my mapreduce operation on this data took 150 seconds.

So when I used my mapred operation on this cluster.. it is taking 220
seconds for same file.

Can some one please tell me How to distribute this file over 8  
nodes - so
that each of them will have roughly 300 mbs of file chunk and the  
mapreduce

operation that I have wrote to work in parallel? Isn't hadoop cluster
supposed to be working in parallel?

best.


_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars  
with Hotmail.

http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5


Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread James Seigel
If you are very adhoc-y,  more bandwidth the merry-er!

James

Sent from my mobile. Please excuse the typos.

On 2011-06-28, at 5:03 PM, Matei Zaharia  wrote:

> Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile 
> your target Hadoop workload and see whether it's communication-bound. Hadoop 
> jobs can definitely be communication-bound if you shuffle a lot of data 
> between map and reduce, but I've also seen a lot of clusters that are 
> CPU-bound (due to decompression, running python, or just running expensive 
> user code) or disk-IO-bound. You might be surprised at what your bottleneck 
> is.
>
> Matei
>
> On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote:
>
>> Matt,
>> Thanks, this is helpful, I was wondering if you may have some thoughts
>> on the list of other potential benefits of 10GbE NICs for Hadoop
>> (listed in my original e-mail to the list)?
>>
>> regards,
>> Saqib
>>
>> -Original Message-
>> From: Matthew Foley [mailto:ma...@yahoo-inc.com]
>> Sent: Tuesday, June 28, 2011 12:04 PM
>> To: common-user@hadoop.apache.org
>> Cc: Matthew Foley
>> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
>>
>> Hadoop common provides an abstract FileSystem class, and Hadoop applications
>> should be designed to run on that.  HDFS is just one implementation of a
>> valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
>> LocalFileSystem are provided in Hadoop common.  Use of NFS-mounted storage
>> would fall under the LocalFileSystem model.
>>
>> However, one of the core values of Hadoop is the model of "bring the
>> computation to the data".  This does not seem viable with an NFS-based
>> NAS-model storage subsystem.  Thus, while it will "work" for small clusters
>> and small jobs, it is unlikely to scale with high performance to thousands
>> of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.
>>
>> --Matt
>>
>>
>> On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:
>>
>> I see. However, Hadoop is designed to operate best with HDFS because of its
>> inherent striping and blocking strategy - which is tracked by Hadoop.
>> Going outside of that mechanism will probably yield poor results and/or
>> confuse Hadoop.
>>
>> Just my thoughts.
>>
>> On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
>>> Darren,
>>> Thanks, the last pt was basically about 10GbE potentially allowing the
>>> use of a network file system e.g. via NFS as an alternative to HDFS,
>>> the question is there any merit in this. Basically, I was exploring if
>>> the commercial clustered NAS products offer any high-availability or
>>> data management benefits for use with Hadoop?
>>>
>>> Saqib
>>>
>>> -Original Message-
>>> From: Darren Govoni [mailto:dar...@ontrenet.com]
>>> Sent: Tuesday, June 28, 2011 10:21 AM
>>> To: common-user@hadoop.apache.org
>>> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
>>>
>>> Hadoop, like other parallel networked computation architectures is I/O
>>> bound, predominantly.
>>> This means any increase in network bandwidth is "A Good Thing" and can
>>> have drastic positive effects on performance. All your points stem
>>> from this simple realization.
>>>
>>> Although I'm confused by your #6. Hadoop already uses a distributed
>>> file system. HDFS.
>>>
>>> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
 Folks,

 I've been digging into the potential benefits of using

 10 Gigabit Ethernet (10GbE) NIC server connections for

 Hadoop and wanted to run what I've come up with

 through initial research by the list for 'sanity check'

 feedback. I'd very much appreciate your input on

 the importance (or lack of it) of the following potential benefits of

 10GbE server connectivity as well as other thoughts regarding

 10GbE and Hadoop (My interest is specifically in the value

 of 10GbE server connections and 10GbE switching infrastructure,

 over scenarios such as bonded 1GbE server connections with

 10GbE switching).



 1.   HDFS Data Loading. The higher throughput enabled by 10GbE

 server and switching infrastructure allows faster processing and

 distribution of data.

 2.   Hadoop Cluster Scalability. High-performance for initial data
 processing

 and distribution directly impacts the degree of parallelism or
 scalability supported

 by the cluster.

 3.   HDFS Replication. Higher speed server connections allows faster
 file replication.

 4.   Map/Reduce Shuffle Phase. Improved end-to-end throughput and
 latency directly impact the

 shuffle phase of a data set reduction especially for tasks that are
 at the document level

 (including large documents) and lots of metadata generated by those
 documents as well as video analy

Re: Writing out a single file

2011-07-05 Thread James Seigel
Single reducer.

On 2011-07-05, at 9:09 AM, Mark wrote:

> Is there anyway I can write out the results of my mapreduce job into 1 local 
> file... ie the opposite of getmerge?
> 
> Thanks



Re: Cygwin not working with Hadoop and Eclipse Plugin

2011-07-26 Thread James Seigel
Try using virtual box/vmware and downloading either an image that has hadoop on 
it or a linux image and installing it there.  

Good luck
James.


On 2011-07-26, at 12:33 PM, A Df wrote:

> Dear All:
> 
> I am trying to run Hadoop on Windows 7 so as to test programs before moving 
> to Unix/Linux. I have downloaded the Hadoop 0.20.2 and Eclipse 3.6 because I 
> want to use the plugin. I am also using cygwin. However, I set the 
> environment variable for JAVA_HOME and added the 
> c:\cygwin\bin;c:\cygwin\usr\bin to the PATH variable but I still get the 
> error below when trying to start the Hadoop. This is based on the 
> instructions to edit the file conf/hadoop-env.sh to define at least JAVA_HOME 
> to be the root of your Java installation which I changed to "export 
> JAVA_HOME=/cygdrive/c/Program\ Files\ \(x86\)/Java/jdk1.6.0_26" with no 
> success. I added the \ to escape special characters.
> 
> 
> Error:
> bin/hadoop: line 258: /cygdrive/c/Program: No such file or directory
> 
> 
> I also wanted to find out which is the stable release of Hadoop and which 
> version of Eclipse and the Plugin should I use? So far almost every tutorial 
> I have seen from Googling shows different versions like on:
> http://developer.yahoo.com/hadoop/tutorial/index.html
> OR
> http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
> 
> In Eclipse the WordCount project has 42 errors because it will not recognize 
> the "import org.apache.." in the code.
> 
> 
> I wanted to test on Windows first to get a feel of Hadoop since I am new to 
> it and also because I am newbie Unix/Linux user. I have been trying to follow 
> the tutorials shown at the link above but each time I run into errors with 
> the plugin or not recognizing the import or JAVA_HOME not set. Please can I 
> get some help. Thanks
> 
> Cheers
> A Df



Re: the same key in different reducers

2010-06-09 Thread James Seigel
Oleg,

Are you wanting to have them in different reducers?  If so then you can write a 
Comparable object to make that happen.

If you want them to be on the same reducer, then that is what hadoop will do.

:)


On 2010-06-09, at 3:06 PM, Ted Yu wrote:

> Can you disclose more about how K3 is generated.
> From your description below, it is possible.
> 
> On Wed, Jun 9, 2010 at 1:17 AM, Oleg Ruchovets  wrote:
> 
>> Hi ,
>> My hadoop job writes results of map/reduce to HBase.
>> I have 3 reducers.
>> 
>> Here is a sequence of input and output parameters for Mapper , Combiner and
>> Reducer
>>   *input: InputFormat
>>   mapper: Mapper
>>   combiner: Reducer
>>   reducer: Reducer
>>   output: RecordWriter
>> 
>> *My question:
>> Is it possible that more than one reducer has the same output key K3.
>> Meaning in case I have 3 reducers is it possible that
>> reducer1K3 -* 1* , V3 [1,2,3]
>> reducer2K3 - 2 , V3 [5,6,9]
>> reducer3K3 - *1* , V3 [10,15,22]
>> 
>> As you can see reducer1 has K3 - 1 and reducer3 has K3 - 1.
>> So is that case possible or every and every reducer has unique output key?
>> 
>> Thanks in advance
>> Oleg.
>> 



Re: Hadoop JobTracker Hanging

2010-06-17 Thread James Seigel
Up the memory from the default to about 4x the default (heap setting).  This 
should make it better I’d think!

We’d been having the same issue...I believe this fixed it.

James

On 2010-06-17, at 3:00 PM, Li, Tan wrote:

> Folks,
> 
> I need some help on job tracker.
> I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with 
> version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68 
> (Cloudera).
> 
> I have the same problem with both the clusters: the job tracker hangs almost 
> once a day.
> Symptom: The job tracker web page can not be loaded, the command "hadoop job 
> -list" hangs and jobtracker.log file stops being updated.
> No useful information can I find in the job tracker log file.
> The symptom is gone after I restart the job tracker and the cluster runs fine 
> for another 20+ hour period. And then the symptom comes back.
> 
> I do not have serious problem with HDFS.
> 
> Any ideas about the causes? Any configuration parameter that I can change to 
> reduce the chances of the problem?
> Any tips for diagnosing and troubleshooting?
> 
> Thanks!
> 
> Tan
> 
> 
> 



Re: Why hadoop-u...@lucene.a.o ?

2010-06-18 Thread James Seigel
Great segway!  Is there a definitive guide to indexing with lucene and hadoop 
and serving up the results some how distributed like!

Thanks gang
James

Sent from my mobile. Please excuse the typos.

On 2010-06-18, at 3:44 AM, Steve Loughran  wrote:

> Otis Gospodnetic wrote:
>> Hello,
>> 
>> I've noticed people send emails to the following address:
>> 
>>hadoop-u...@lucene.apache.org
>> 
>> Why?
>> Is this supposed to be related to common-user@hadoop.apache.org list?
>> But why would any Hadoop mailing list be @lucene.a.o?
>> 
> 
> Historical: it's where Hadoop began -the indexing infrastructure for lucene.
> 
> hadoop-u...@lucene.a.o and hadoop-u...@hadoop.a.o. both redirect to 
> common-user@hadoop.apache.org, which you can see from the mail headers:
> 
> Reply-To: common-user@hadoop.apache.org


Re: Hadoop JobTracker Hanging

2010-06-21 Thread James Seigel
Good luck Bobby.  I hope that when you get this problem licked you’ll post your 
solutions to help us all learn some more stuff as well :)

Cheers
James.

On 2010-06-21, at 1:49 PM, Bobby Dennett wrote:

> Thanks all for your suggestions (please note that Tan is my co-worker;
> we are both working to try and resolve this issue)... we experienced
> another hang this weekend and increased the HADOOP_HEAPSIZE setting to
> 6000 (MB) as we do periodically see "java.lang.OutOfMemoryError: Java
> heap space" errors in the jobtracker log. We are now looking into the
> resource allocation of the master node/server to ensure we aren't
> experiencing any issues due to the heap size increase. In parallel, we
> are also working on building "beefier" servers -- stronger CPUs, 3x more
> memory -- for the node running the primary namenode and jobtracker
> processes as well as for the secondary namenode.
> 
> Any additional suggestions you might have for troubleshooting/resolving
> this hanging jobtracker issue would be greatly appreciated.
> 
> Please note that I had previously started a similar topic on Get
> Satisfaction
> (http://www.getsatisfaction.com/cloudera/topics/looking_for_troubleshooting_tips_guidance_for_hanging_jobtracker)
> where Todd is helping and the output of jstack and jmap can be found.
> 
> Thanks,
> -Bobby
> 
> On Fri, 18 Jun 2010 15:04 -0600, "Li, Tan"  wrote:
>> Todd,
>> I will try to increase the HADOOP_HEAPSIZE to see if that helps.
>> Tan
>> 
>> -Original Message-
>> From: Todd Lipcon [mailto:t...@cloudera.com] 
>> Sent: Thursday, June 17, 2010 5:07 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Hadoop JobTracker Hanging
>> 
>> Li, just to narrow your search, in my experience this is usually caused
>> by
>> OOME on the JT. Check the logs for OutOfMemoryException, see what you
>> find.
>> You may need to configure it to retain fewer jobs in memory, or up your
>> heap.
>> 
>> -Todd
>> 
>> On Thu, Jun 17, 2010 at 5:03 PM, Li, Tan  wrote:
>> 
>>> Thanks for your tips, Ted.
>>> All of our QA is done on 0.20.1, and I got a feeling it is not version
>>> related.
>>> I will run jstack and jmap once the problem happens again and I may need
>>> your help to analyze the result.
>>> 
>>> Tan
>>> 
>>> -Original Message-
>>> From: Ted Yu [mailto:yuzhih...@gmail.com]
>>> Sent: Thursday, June 17, 2010 2:39 PM
>>> To: common-user@hadoop.apache.org
>>> Subject: Re: Hadoop JobTracker Hanging
>>> 
>>> Is upgrading to hadoop-0.20.2+228 possible ?
>>> 
>>> Use jstack to get stack trace of job tracker process when this happens
>>> again.
>>> Use jmap to get shared object memory maps or heap memory details.
>>> 
>>> On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan  wrote:
>>> 
 Folks,
 
 I need some help on job tracker.
 I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is
>>> with
 version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
 (Cloudera).
 
 I have the same problem with both the clusters: the job tracker hangs
 almost once a day.
 Symptom: The job tracker web page can not be loaded, the command "hadoop
 job -list" hangs and jobtracker.log file stops being updated.
 No useful information can I find in the job tracker log file.
 The symptom is gone after I restart the job tracker and the cluster runs
 fine for another 20+ hour period. And then the symptom comes back.
 
 I do not have serious problem with HDFS.
 
 Any ideas about the causes? Any configuration parameter that I can change
 to reduce the chances of the problem?
 Any tips for diagnosing and troubleshooting?
 
 Thanks!
 
 Tan
 
 
 
 
>>> 
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
>> 



Re: Hadoop JobTracker Hanging

2010-06-22 Thread James Seigel
+1 for compressed pointers.  

Sent from my mobile. Please excuse the typos.

On 2010-06-22, at 4:18 AM, Steve Loughran  wrote:

> Bobby Dennett wrote:
>> Thanks all for your suggestions (please note that Tan is my co-worker;
>> we are both working to try and resolve this issue)... we experienced
>> another hang this weekend and increased the HADOOP_HEAPSIZE setting to
>> 6000 (MB) as we do periodically see "java.lang.OutOfMemoryError: Java
>> heap space" errors in the jobtracker log. We are now looking into the
>> resource allocation of the master node/server to ensure we aren't
>> experiencing any issues due to the heap size increase. In parallel, we
>> are also working on building "beefier" servers -- stronger CPUs, 3x more
>> memory -- for the node running the primary namenode and jobtracker
>> processes as well as for the secondary namenode.
>> 
>> Any additional suggestions you might have for troubleshooting/resolving
>> this hanging jobtracker issue would be greatly appreciated.
> 
> Have you tried
>  * using compressed object pointers on java 6 server? They reduce space
> 
>  * bolder: JRockit JVM. Not officially supported in Hadoop, but I liked 
> using right up until oracle stopped giving away the updates with 
> security patches. It has a way better heap as well as compressed 
> pointers for a long time (==more stable code)
> 
> I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 
> 2ary NN that use more, especially if the files are many and the 
> blocksize small. the JT should not be tracking that much data over time


Re: Newbie to HDFS compression

2010-06-24 Thread James Seigel
Cool.  Maybe we should start a page.

J
On 2010-06-24, at 8:16 PM, Harsh J wrote:

> On Fri, Jun 25, 2010 at 2:42 AM, Raymond Jennings III
>  wrote:
>> Oh, maybe that's what I meant :-)  I recall reading something on this mail 
>> group that "the compression" in not included with the hadoop binary and that 
>> you have to get and install it separately due to license incompatibilities.  
>> Looking at the config xml files it's not clear what I need to do.  Thanks.
>> 
> LZO Compression is the one you probably read about. Otherwise
> available CompressionCodecs are BZip2 and GZip, and you should be able
> to use those files just fine.
> 
> Something like FileOutputFormat.setCompressOutput(conf, true);
> 
> (Also look at mapred.output.compress configuration var for
> map-output-compression)
>> 
>> 
>> - Original Message 
>> From: Eric Sammer 
>> To: common-user@hadoop.apache.org
>> Sent: Thu, June 24, 2010 5:09:33 PM
>> Subject: Re: Newbie to HDFS compression
>> 
>> There is no file system level compression in HDFS. You can stored
>> compressed files in HDFS, however.
>> 
>> On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
>>  wrote:
>>> Are there instructions on how to enable (which type?) of compression on 
>>> hdfs?  Does this have to be done during installation or can it be added to 
>>> a running cluster?
>>> 
>>> Thanks,
>>> Ray
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> Harsh J
> www.harshj.com



Re: Newbie to HDFS compression

2010-06-24 Thread James Seigel
OOps.  Replied to wrong email.  

Well I should add something useful to the conversation now.

I think LZO has all the right features.  However, not great support in Pig if 
that is what you are using.

It is good to have something splittable.  LZO - check

Compress intermediate files...this is a no brainer.

Stick with it...it is complicated ( a bit )  to install

Cheers
J

On 2010-06-24, at 8:45 PM, James Seigel wrote:

> Cool.  Maybe we should start a page.
> 
> J
> On 2010-06-24, at 8:16 PM, Harsh J wrote:
> 
>> On Fri, Jun 25, 2010 at 2:42 AM, Raymond Jennings III
>>  wrote:
>>> Oh, maybe that's what I meant :-)  I recall reading something on this mail 
>>> group that "the compression" in not included with the hadoop binary and 
>>> that you have to get and install it separately due to license 
>>> incompatibilities.  Looking at the config xml files it's not clear what I 
>>> need to do.  Thanks.
>>> 
>> LZO Compression is the one you probably read about. Otherwise
>> available CompressionCodecs are BZip2 and GZip, and you should be able
>> to use those files just fine.
>> 
>> Something like FileOutputFormat.setCompressOutput(conf, true);
>> 
>> (Also look at mapred.output.compress configuration var for
>> map-output-compression)
>>> 
>>> 
>>> - Original Message 
>>> From: Eric Sammer 
>>> To: common-user@hadoop.apache.org
>>> Sent: Thu, June 24, 2010 5:09:33 PM
>>> Subject: Re: Newbie to HDFS compression
>>> 
>>> There is no file system level compression in HDFS. You can stored
>>> compressed files in HDFS, however.
>>> 
>>> On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
>>>  wrote:
>>>> Are there instructions on how to enable (which type?) of compression on 
>>>> hdfs?  Does this have to be done during installation or can it be added to 
>>>> a running cluster?
>>>> 
>>>> Thanks,
>>>> Ray
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Eric Sammer
>>> twitter: esammer
>>> data: www.cloudera.com
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Harsh J
>> www.harshj.com
> 



Re: WritableComparable question

2010-07-19 Thread James Seigel
It has to reuse or the object creation would be killer!

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2010-07-19, at 3:16 PM, Raymond Jennings III  wrote:

> I am trying to convert some MR programs that use Text only to instead use 
> some 
> custom classes.  One of my classes has a Vector type and I found that the 
> vector 
> grows with each call to my reducer such that the last call to the reducer has 
> every value within the Vector from all of the classes that use this vector 
> type.  Does MR reuse objects?
> 
> The only way I could fix this was to re-initialize my vectors in the "public 
> void readFields(DataInput in)" method.  This does not seem like I should have 
> to 
> do this or do I ???
> 
> Thanks,
> Ray
> 
> 
> 
> 


From X to Hadoop MapReduce

2010-07-20 Thread James Seigel
Hello!

Here is what I have been thinking over the last while.  There are probably a 
number of us that have prototyped stuff in pig or hive and now think that we 
should convert it to some java map reduce code.  

Would anyone be interested in building a “patterns” area somewhere possibly 
with some code to give solid usable examples build some map reduce code without 
having to re-invent the wheel each time?

Example: 

What would be the “best” or “a” way to make the DISTINCT command from pig in a 
java map reduce program?

Let me know if anyone is interested in this.  I’d like to get some sharing 
going.

Cheers
James.




Re: From X to Hadoop MapReduce

2010-07-21 Thread James Seigel
Jeff, I agree that cascading looks cool and might/should have a place in 
everyone’s tool box, however at some corps it takes a while to get those kinds 
of changes in place and therefore they might have to hand craft some java code 
before moving (if they ever can) to a different technology.

I will get something up and going and post a link back for whomever is 
interested.

To answer Himanshu’s question, I am thinking something like this (with some 
code):

Hadoop M/R Patterns, and ones that match Pig Structures

1. COUNT: [Mapper] Spit out one key and the value of 1. [Combiner] Same as 
reducer. [Reducer] count = count + next.value.  [Emit] Single result.
2. FREQ COUNT: [Mapper] Item, 1.  [Combiner] Same as reducer. [Reducer] count = 
count + next.value.  [Emit] list of Key, count
3. UNIQUE: [Mapper] Item, One.  [Combiner] None.  [Reducer + Emit] spit out 
list of keys and no value.

I think adding a description of why the technique works would be helpful for 
people learning as well.  I see some questions from people not understanding 
what happens to the data between mappers and reducers, or what data they will 
see when it gets to the reducer...etc...

Cheers
James.



Re: From X to Hadoop MapReduce

2010-07-21 Thread James Seigel
Here is a skeleton project I stuffed up on github (feel free to offer other 
suggestions/alternatives).  There is a wiki, a place to commit code, a place to 
fork around, etc..

Over the next couple of days I’ll try and put up some sample samples for people 
to poke around with.  Feel free to attack the wiki, contribute code, etc...

If anyone can derive some cool pseudo code to write map reduce type algorithms 
that’d be great.

Cheers
James.


On 2010-07-21, at 10:51 AM, James Seigel wrote:

> Jeff, I agree that cascading looks cool and might/should have a place in 
> everyone’s tool box, however at some corps it takes a while to get those 
> kinds of changes in place and therefore they might have to hand craft some 
> java code before moving (if they ever can) to a different technology.
> 
> I will get something up and going and post a link back for whomever is 
> interested.
> 
> To answer Himanshu’s question, I am thinking something like this (with some 
> code):
> 
> Hadoop M/R Patterns, and ones that match Pig Structures
> 
> 1. COUNT: [Mapper] Spit out one key and the value of 1. [Combiner] Same as 
> reducer. [Reducer] count = count + next.value.  [Emit] Single result.
> 2. FREQ COUNT: [Mapper] Item, 1.  [Combiner] Same as reducer. [Reducer] count 
> = count + next.value.  [Emit] list of Key, count
> 3. UNIQUE: [Mapper] Item, One.  [Combiner] None.  [Reducer + Emit] spit out 
> list of keys and no value.
> 
> I think adding a description of why the technique works would be helpful for 
> people learning as well.  I see some questions from people not understanding 
> what happens to the data between mappers and reducers, or what data they will 
> see when it gets to the reducer...etc...
> 
> Cheers
> James.
> 



Re: From X to Hadoop MapReduce

2010-07-21 Thread James Seigel
Oh yeah, it would help if I put the url: 

http://github.com/seigel/MRPatterns

James

On 2010-07-21, at 2:55 PM, James Seigel wrote:

> Here is a skeleton project I stuffed up on github (feel free to offer other 
> suggestions/alternatives).  There is a wiki, a place to commit code, a place 
> to fork around, etc..
> 
> Over the next couple of days I’ll try and put up some sample samples for 
> people to poke around with.  Feel free to attack the wiki, contribute code, 
> etc...
> 
> If anyone can derive some cool pseudo code to write map reduce type 
> algorithms that’d be great.
> 
> Cheers
> James.
> 
> 
> On 2010-07-21, at 10:51 AM, James Seigel wrote:
> 
>> Jeff, I agree that cascading looks cool and might/should have a place in 
>> everyone’s tool box, however at some corps it takes a while to get those 
>> kinds of changes in place and therefore they might have to hand craft some 
>> java code before moving (if they ever can) to a different technology.
>> 
>> I will get something up and going and post a link back for whomever is 
>> interested.
>> 
>> To answer Himanshu’s question, I am thinking something like this (with some 
>> code):
>> 
>> Hadoop M/R Patterns, and ones that match Pig Structures
>> 
>> 1. COUNT: [Mapper] Spit out one key and the value of 1. [Combiner] Same as 
>> reducer. [Reducer] count = count + next.value.  [Emit] Single result.
>> 2. FREQ COUNT: [Mapper] Item, 1.  [Combiner] Same as reducer. [Reducer] 
>> count = count + next.value.  [Emit] list of Key, count
>> 3. UNIQUE: [Mapper] Item, One.  [Combiner] None.  [Reducer + Emit] spit out 
>> list of keys and no value.
>> 
>> I think adding a description of why the technique works would be helpful for 
>> people learning as well.  I see some questions from people not understanding 
>> what happens to the data between mappers and reducers, or what data they 
>> will see when it gets to the reducer...etc...
>> 
>> Cheers
>> James.
>> 
> 



Re: Starting a job on a hadoop cluster remotly

2010-07-28 Thread James Seigel
Not sure exactly your goals, but look into SOCKS proxy stuff as well.  You can 
have the hadoop command binary running locally and talking over a socks proxy 
to the actual cluster, without having to have the machines exposed all over the 
place.

Cheers
James.


On 2010-07-28, at 10:42 AM, Michael Sutter wrote:

> 
>  Hey,
> 
> yes it is possible. I'm doing it exactly this way for my implementation 
> from a remote client.
> 
> Implement it like this:
> Configuration conf = new Configuration();
> conf.set("hadoop.job.ugi", "user, group");
> conf.set("namenode.host", somehost.somedomain);
> conf.set("jobtracker.host", somehost.somedomain);
> conf.set("mapred.job.tracker", somehost.somedomain:someport);
> conf.set("fs.default.name", hdfs://somehost.somedomain:someport);
> Job job = new Job(conf, jobname);
> job.setJarByClass(...);
> ...
> 
> Cheers
> Michael
> 
> On 07/28/2010 03:36 PM, Sebastian Ruff (open4business GmbH) wrote:
>> Hey,
>> 
>> 
>> 
>> is it possible to start a job on a hadoop cluster from remote? For
>> example we have a web application
>> 
>> which runs on an apache tomcat server. And would like to start a
>> mapreduce job on our cluster, from
>> 
>> within the webapp.
>> 
>> 
>> 
>> Is this possible? And if yes, what are the steps to get there? Do I just
>> have to put my namenode and datanode
>> 
>> in a core-site.xml in the webapp and call the api?
>> 
>> 
>> 
>> Thanks a lot,
>> 
>> 
>> 
>> Sebastian
>> 
>> 
>> 
>> 



Re: Basic question

2010-08-25 Thread James Seigel
The output of the reducer is Text/IntWritable. 

To set the "input" to the reducer you set the mapper output classes. 

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2010-08-25, at 8:13 PM, "Mark"  wrote:

>  job.setOutputKeyClass(Text.class);
> job.setOutputValueClass(IntWritable.class);
> 
> Does this mean the input to the reducer should be Text/IntWritable or 
> the output of the reducer is Text/IntWritable?
> 
> What is the inverse of this.. setInputKeyClass/setInputValueClass? Is 
> this inferred by the JobInputFormatClass? Would someone mind briefly 
> explaining?
> 
> Thanks


Re: From X to Hadoop MapReduce

2010-09-01 Thread James Seigel
Sounds good!  Please give some examples :)

I just got back from some holidays and will start posting some more stuff 
shortly

Cheers
James.


On 2010-07-21, at 7:22 PM, Jeff Zhang wrote:

> Cool, James. I am very interested to contribute to this.
> I think group by, join and order by can been added to the examples.
> 
> 
> On Thu, Jul 22, 2010 at 4:59 AM, James Seigel  wrote:
> 
>> Oh yeah, it would help if I put the url:
>> 
>> http://github.com/seigel/MRPatterns
>> 
>> James
>> 
>> On 2010-07-21, at 2:55 PM, James Seigel wrote:
>> 
>>> Here is a skeleton project I stuffed up on github (feel free to offer
>> other suggestions/alternatives).  There is a wiki, a place to commit code, a
>> place to fork around, etc..
>>> 
>>> Over the next couple of days I’ll try and put up some sample samples for
>> people to poke around with.  Feel free to attack the wiki, contribute code,
>> etc...
>>> 
>>> If anyone can derive some cool pseudo code to write map reduce type
>> algorithms that’d be great.
>>> 
>>> Cheers
>>> James.
>>> 
>>> 
>>> On 2010-07-21, at 10:51 AM, James Seigel wrote:
>>> 
>>>> Jeff, I agree that cascading looks cool and might/should have a place in
>> everyone’s tool box, however at some corps it takes a while to get those
>> kinds of changes in place and therefore they might have to hand craft some
>> java code before moving (if they ever can) to a different technology.
>>>> 
>>>> I will get something up and going and post a link back for whomever is
>> interested.
>>>> 
>>>> To answer Himanshu’s question, I am thinking something like this (with
>> some code):
>>>> 
>>>> Hadoop M/R Patterns, and ones that match Pig Structures
>>>> 
>>>> 1. COUNT: [Mapper] Spit out one key and the value of 1. [Combiner] Same
>> as reducer. [Reducer] count = count + next.value.  [Emit] Single result.
>>>> 2. FREQ COUNT: [Mapper] Item, 1.  [Combiner] Same as reducer. [Reducer]
>> count = count + next.value.  [Emit] list of Key, count
>>>> 3. UNIQUE: [Mapper] Item, One.  [Combiner] None.  [Reducer + Emit] spit
>> out list of keys and no value.
>>>> 
>>>> I think adding a description of why the technique works would be helpful
>> for people learning as well.  I see some questions from people not
>> understanding what happens to the data between mappers and reducers, or what
>> data they will see when it gets to the reducer...etc...
>>>> 
>>>> Cheers
>>>> James.
>>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang



Re: Sorting Numbers using mapreduce

2010-09-06 Thread James Seigel
There is a call to seethe sort order as well, by changing the comparator. 

James

Sent from my mobile. Please excuse the typos.

On 2010-09-06, at 12:06 AM, "Owen O'Malley"  wrote:

> The critical item is that your map's output key should be IntWritable
> instead of Text. The default comparator for IntWritable will give you
> properly sorted numbers. If you stringify the numbers and output them
> as text, they'll get sorted as strings.
> 
> -- Owen


Re: TOP N items

2010-09-10 Thread James Seigel
Welcome to the land of the fuzzy elephant!

Of course there are many ways to do it.  Here is one, it might not be brilliant 
or the right was, but I am sure you will get more :)

Use the identity mapper...

job.setMapperClass(Mapper.class);

then have one reducer

job.setNumReduceTasks(1);

then have a reducer that has something like this around your reducing code...

Counter counter = context.getCounter(“ME", "total output records");
if (counter.getValue() < LIMIT) {



context.write(key, value);
counter.increment(1);
}


Cheers
James.



On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:

Hello ,

I am new to Hadoop.Can anybody suggest any example or procedure of
outputting TOP N items having maximum total count, where the input file has
have (Item, count ) pair  in each line .

Items can repeat.

Thanks
Neil
http://neilghosh.com

--
Thanks and Regards
Neil
http://neilghosh.com



Re: Custom Key class not working correctly

2010-09-10 Thread James Seigel
Is the footer on this email a little rough for content that will be passed 
around and made indexable on the internets?

Just saying :)

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2010-09-10, at 8:01 PM, "Kaluskar, Sanjay"  wrote:

> Have you considered using something higher-level like PIG or Hive? Are
> there reasons why you need to process at this low level?
> 
> -Original Message-
> From: Aaron Baff [mailto:aaron.b...@telescope.tv] 
> Sent: Friday, September 10, 2010 11:50 PM
> To: common-user@hadoop.apache.org
> Subject: Custom Key class not working correctly
> 
> So I'm pretty new to Hadoop, just learning it for work, and starting to
> play with some of our data on a VM cluster to see it work, and to make
> sure it can do what we need to. By and large, very cool, I think I'm
> getting the hang of it, but when I try and make a custom composite key
> class, it doesn't seem to correctly group the data correctly.
> 
> The data is a bunch of phone numbers with various transactional data
> (timestamp, phone type, other call data). My Mapper is pretty much just
> taking the data, and splitting it out into a custom Key (or Text with
> just the phone number) and custom Value to hold the rest of the data.
> 
> In my reducer, I'm counting the number of unique phone numbers among
> other things using a Reporter counter. Using my key class (code below),
> I get a total of 56,404 unique numbers which is way too low. When I use
> just the phone number (using Text) as the key, it gives me 1,159,558
> which is correct. In my custom class hashCode() method I'm just using
> the String.hashCode() for the String holding the phone number.
> 
> That seemed reasonable to me, since I wanted it to group the values by
> the phone number, and then order by the timestamp which is what I'm
> doing in the compareTo() function.
> 
> 
> 
> 
> 
> import java.io.DataInput;
> import java.io.DataOutput;
> import java.io.IOException;
> import org.apache.hadoop.io.WritableComparable;
> import org.apache.hadoop.io.WritableComparator;
> 
> public class AIMdnTimeKey implements WritableComparable {
>String mdn = "";
>long timestamp = -1L;
>private byte oli = 0;
> 
>public AIMdnTimeKey() {
>}
> 
>public AIMdnTimeKey( String initMdn, long initTimestamp) {
>mdn = initMdn;
>timestamp = initTimestamp;
>}
> 
>public void setMdn( String newMdn ) {
>mdn = newMdn;
>}
> 
>public String getMdn() {
>return mdn;
>}
> 
>public void setTimestamp( long newTimestamp ) {
>timestamp = newTimestamp;
>}
> 
>public long getTimestamp() {
>return timestamp;
>}
> 
>public void write(DataOutput out) throws IOException {
>out.writeUTF(mdn);
>out.writeByte(oli);
>out.writeLong(timestamp);
>}
> 
>public void readFields(DataInput in) throws IOException {
>mdn = in.readUTF();
>oli = in.readByte();
>timestamp = in.readLong();
>}
> 
>public int compareTo(Object obj) throws ClassCastException {
>if (obj == null) {
>throw new ClassCastException("Object is NULL and so cannot
> be compared!");
>}
>if (getClass() != obj.getClass()) {
>throw new ClassCastException("Object is of type " +
> obj.getClass().getName() + " which cannot be compared to this class of
> type " + getClass().getName());
>}
>final AIMdnTimeKey other = (AIMdnTimeKey) obj;
> 
>return (int)(this.timestamp - other.timestamp);
>}
> 
>@Override
>public int hashCode() {
> 
>return mdn.hashCode();
>}
> 
>@Override
>public boolean equals(Object obj) {
>if (obj == null) {
>return false;
>}
>if (getClass() != obj.getClass()) {
>return false;
>}
>final AIMdnTimeKey other = (AIMdnTimeKey) obj;
>if ((this.mdn == null) ? (other.mdn != null) :
> !this.mdn.equals(other.mdn)) {
>return false;
>}
>return true;
>}
> 
>@Override
>public String toString() {
>return mdn + " " + timestamp;
>}
> 
>/**
> * @return the oli
> */
>public byte getOli() {
>return oli;
>}
> 
>/**
> * @param oli the oli to set
> */
>public void setOli(byte oli) {
>this.oli = oli;
>}
> }
> 
> 
> 
> 
> 
> 
> Aaron Baff | Developer | Telescope, Inc.
> 
> email:  aaron.b...@telescope.tv |
> office:  424 270 2913 | www.telescope.tv
> 
> The information contained in this email is confidential and may be
> legally privileged. It is intended solely for the addressee. Access to
> this email by anyone else is unauthorized. If you are not the intended
>

Re: Question on classpath

2010-09-10 Thread James Seigel
Are the libs exploded inside the main jar?  If not then no it probably won't 
work. 

James

Sent from my mobile. Please excuse the typos.

On 2010-09-10, at 7:43 PM, "Mark"  wrote:

>  If I deploy 1 jar (that contains a lib directory with all the required 
> dependencies) shouldn't that jar be inherently be distributed to all the 
> nodes?
> 
> On 9/10/10 2:49 PM, Mark wrote:
>> I dont know? I'm running in a fully distributed environment.. ie not 
>> local or psuedo.
>> 
>> On 9/10/10 12:03 PM, Allen Wittenauer wrote:
>>> On Sep 10, 2010, at 11:53 AM, Mark wrote:
>>> 
 If I submit a jar that has a lib directory that contains a bunch of 
 jars, shouldn't those jars be in the classpath and available to all 
 nodes?
>>> Are you using distributed cache?
>>> 


Re: BUG: Anyone use block size more than 2GB before?

2010-10-18 Thread James Seigel
If there is a hard requirement for input split being one block you could just 
make your input split fit a smaller block size. 

Just saying, in case you can't overcome the 2G ceiling

J

Sent from my mobile. Please excuse the typos.

On 2010-10-18, at 5:08 PM, "elton sky"  wrote:

>> Why would you want to use a block size of > 2GB?
> For keeping a maps input split in a single block~
> 
> On Tue, Oct 19, 2010 at 9:07 AM, Michael Segel 
> wrote:
> 
>> 
>> Ok, I'll bite.
>> Why would you want to use a block size of > 2GB?
>> 
>> 
>> 
>>> Date: Mon, 18 Oct 2010 21:33:34 +1100
>>> Subject: BUG: Anyone use block size more than 2GB before?
>>> From: eltonsky9...@gmail.com
>>> To: common-user@hadoop.apache.org
>>> 
>>> Hello,
>>> 
>>> In
>>> 
>> hdfs.org.apache.hadoop.hdfs.DFSClient
>>> 
>> .DFSOutputStream.writeChunk(byte[]
>>> b, int offset, int len, byte[] checksum)
>>> The second last line:
>>> 
>>> int psize = Math.min((int)(blockSize-bytesCurBlock), writePacketSize);
>>> 
>>> When I use blockSize  bigger than 2GB, which is out of the boundary of
>>> integer something weird would happen. For example, for a 3GB block it
>> will
>>> create more than 2Million packets.
>>> 
>>> Anyone noticed this before?
>>> 
>>> Elton
>> 
>> 


Re: Urgent Need: Sr. Developer - Hadoop Hive | Cupertino, CA

2010-10-29 Thread James Seigel
Seems a little light ;)

J


On 2010-10-29, at 3:45 PM, Pablo Cingolani wrote:

> Seriously? 10 of experience in Hive, Hadoop and MongoDb? :-)
> 
> On Fri, Oct 29, 2010 at 5:38 PM, Ram Prakash
>  wrote:
>> 
>> Job Title: Sr. Developer - Hadoop Hive
>> Location: Cupertino, CA
>> 
>> Relevant Experience (Yrs)   10 +  Yrs
>> 
>> Technical/Functional Skills 10+ years of strong technical and
>> implementation experience in diversified data warehouse technologies like
>> 1. Teradata
>> 2. Hadoop-Hive
>> 3. GreenPlum
>> 4. MongoDB
>> 5. Oracle Cohorence
>> 6. Timestan
>> - Good understanding on pros and cons on data warehousing technologies
>> - Have past experience in evaluating Data Warehousing technologies
>> - Handled large volume of data for processing and reporting
>> - Possess good team leader skills
>> 
>> Roles & ResponsibilitiesTechnical expert in EDW
>> 
>> Please send me resume with contact information.
>> 
>> 
>> Thanks,
>> Ram Prakash
>> E-Solutionsin, Inc
>> ram.prak...@e-solutionsinc.com
>> www.e-solutionsinc.com
>> --
>> View this message in context: 
>> http://old.nabble.com/Urgent-Need%3A-Sr.-Developer---Hadoop-Hive-%7C-Cupertino%2C-CA-tp30088922p30088922.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> 
>> 



Re: Two questions.

2010-11-03 Thread James Seigel
Option 1 = good

Sent from my mobile. Please excuse the typos.

On 2010-11-03, at 8:27 PM, "shangan"  wrote:

> I don't think the first two options can work, even you stop the tasktracker 
> these to-be-retired nodes are still connected to the namenode.
> Option 3 can work.  You only need to add this exclude file on the namenode, 
> and it is an regular file. Add a key named dfs.hosts.exclude to your 
> conf/hadoop-site.xml file,The value associated with this key provides the 
> full path to a file on the NameNode's local file system which contains a list 
> of machines which are not permitted to connect to HDFS.
> 
> Then you can run the command bin/hadoop dfsadmin -refreshNodes, then the 
> cluster will decommission the nodes in the exclude file.This might take a 
> period of time as the cluster need to move data from those retired nodes to 
> left nodes.
> 
> After this you can use these retired nodes as a new cluster.But remember to 
> remove those nodes from the slave nodes file and you can delete the exclude 
> file afterward.
> 
> 
> 2010-11-04 
> 
> 
> 
> shangan
> 
> 
> 
> 
> 发件人: Raj V 
> 发送时间: 2010-11-04  10:05:44 
> 收件人: common-user 
> 抄送: 
> 主题: Two questions. 
> 
> 1. I have a 512 node cluster. I need to have 32 nodes do something else. They 
> can be datanodes but I cannot run any map or reduce jobs on them. So I see 
> three 
> options.
> 1. Stop the tasktracker on those nodes. leave the datanode running.
> 2. Set  mapred.tasktracker.reduce.tasks.maximum and 
> mapred.tasktracker.map.tasks.maximum to 0 on these nodes and make these final.
> 3. Use the parameter mapred.hosts.exclude. 
> I am assuming that any of the three methods would work.  To start with, I 
> went 
> with option 3. I used a local file /home/hadoop/myjob.exclude and the file 
> myjob.exclude had the hostname of one host per line ( hadoop-480 .. 
> hadoop-511. 
> But I see both map and reduce jobs being scheduled to all the 511 nodes. 
> I understand there is an inherent inefficieny by running only the data node 
> on 
> these 32 nodess.
> Here are my questions.
> 1. Will all three methods work?
> 2. If I choose method 3, does this file exist as a dfs file or a regular 
> file. 
> If regular file , does it need to exist on all the nodes or only the node 
> where 
> teh job is submitted?
> Many thanks in advance/
> Raj
> __ Information from ESET NOD32 Antivirus, version of virus signature 
> database 5574 (20101029) __
> The message was checked by ESET NOD32 Antivirus.
> http://www.eset.com


Re: Cluster setup

2010-11-09 Thread James Seigel
You can start with one machine or one VM if you are just looking to try out 
hadoop.  

James.


On 2010-11-09, at 11:47 AM, Fabio A. Miranda wrote:

> hello,
> 
> I am trying to setup an Hadoop cluster. From the docs, it says I need
> two master: NameNode and Jobtracker and one slave: datanode,
> tasktracker.
> 
> so, I need at least 4 machines to set up a cluster with hadoop ?
> 
> How can I define the role of each machine if core-site.xml needs to be
> the same ?
> 
> regards,
> 
> fabio.
> 
> 



Re: Hadoop Certification Progamme

2010-12-15 Thread James Seigel
But it would give you the right creds for people that you’d want to work for :)

James


On 2010-12-15, at 10:26 AM, Konstantin Boudnik wrote:

> Hey, commit rights won't give you a nice looking certificate, would it? ;)
> 
> On Wed, Dec 15, 2010 at 09:12, Steve Loughran  wrote:
>> On 09/12/10 03:40, Matthew John wrote:
>>> 
>>> Hi all,.
>>> 
>>> Is there any valid Hadoop Certification available ? Something which adds
>>> credibility to your Hadoop expertise.
>>> 
>> 
>> Well, there's always providing enough patches to the code to get commit
>> rights :)
>> 



Re: Hadoop File system performance counters

2010-12-15 Thread James Seigel
They represent the amount data written to the physical disk on the slaves, as 
intermediate files before or during the shuffle phase.  Where HDFS bytes are 
the files written back into hdfs containing the data you wish to see.

J

On 2010-12-15, at 10:37 AM, abhishek sharma wrote:

> Hi,
> 
> What do the following two File Sytem counters associated with a job
> (and printed at the end of a job's execution) represent?
> 
> FILE_BYTES_READ and FILE_BYTES_WRITTEN
> 
> How are they different from the HDFS_BYTES_READ and HDFS_BYTES_WRITTEN?
> 
> Thanks,
> Abhishek



Re: Hadoop Certification Progamme

2010-12-15 Thread James Seigel
Yes.  Thanks for the correction.  

James
On 2010-12-15, at 10:55 AM, Konstantin Boudnik wrote:

> On Wed, Dec 15, 2010 at 09:28, James Seigel  wrote:
>> But it would give you the right creds for people that you’d want to work for 
>> :)
> 
> I believe you meant to say "you'd want to work _with_" ? 'cause from my
> experience people you work _for_ care more about nice looking
> certificates rather than real creds such as apache commit rights.
> 
>> James
>> 
>> 
>> On 2010-12-15, at 10:26 AM, Konstantin Boudnik wrote:
>> 
>>> Hey, commit rights won't give you a nice looking certificate, would it? ;)
>>> 
>>> On Wed, Dec 15, 2010 at 09:12, Steve Loughran  wrote:
>>>> On 09/12/10 03:40, Matthew John wrote:
>>>>> 
>>>>> Hi all,.
>>>>> 
>>>>> Is there any valid Hadoop Certification available ? Something which adds
>>>>> credibility to your Hadoop expertise.
>>>>> 
>>>> 
>>>> Well, there's always providing enough patches to the code to get commit
>>>> rights :)
>>>> 
>> 
>> 



Re: Please help with hadoop configuration parameter set and get

2010-12-17 Thread James Seigel
If you're wondering where to get a counter from, I'll point you to "context"

Sent from my mobile. Please excuse the typos.

On 2010-12-17, at 7:39 AM, Ted Yu  wrote:

> You can use hadoop counter to pass this information.
> This way, you see the counters in job report.
>
> On Thu, Dec 16, 2010 at 10:58 PM, Peng, Wei  wrote:
>
>> Hi,
>>
>>
>>
>> I am a newbie of hadoop.
>>
>> Today I was struggling with a hadoop problem for several hours.
>>
>>
>>
>> I initialize a parameter by setting job configuration in main.
>>
>> E.g. Configuration con = new Configuration();
>>
>> con.set("test", "1");
>>
>> Job job = new Job(con);
>>
>>
>>
>> Then in the mapper class, I want to set "test" to "2". I did it by
>>
>> context.getConfiguration().set("test","2");
>>
>>
>>
>> Finally in the main method, after the job is finished, I check the
>> "test" again by
>>
>> job.getConfiguration().get("test");
>>
>>
>>
>> However, the value of "test" is still "1".
>>
>>
>>
>> The reason why I want to change the parameter inside Mapper class is
>> that I want to determine when to stop an iteration in the main method.
>> For example, for doing breadth-first search, when there is no new nodes
>> are added for further expansion, the searching iteration should stop.
>>
>>
>>
>> Your help will be deeply appreciated. Thank you
>>
>>
>>
>> Wei
>>
>>


Re: Please help with hadoop configuration parameter set and get

2010-12-17 Thread James Seigel
Exactly.  It is hard to assume anything about order or coordination between 
maps or reducers.  You should try and design with “sloppy” coordination 
strategies.

James.

On 2010-12-17, at 10:32 AM, Ted Dunning wrote:

> Statics won't work the way you might think because different mappers and
> different reducers are all running in different JVM's.  It might work in
> local mode, but don't kid yourself about it working in a distributed mode.
> It won't.
> 
> On Fri, Dec 17, 2010 at 8:31 AM, Peng, Wei  wrote:
> 
>> Arindam, how to set this global static Boolean variable?
>> I have tried to do something similarly yesterday in the following:
>> Public class BFSearch
>> {
>>   Private static boolean expansion;
>>   Public static class MapperClass {if no nodes expansion = false;}
>>   Public static class ReducerClass
>>   Public static void main {expansion = true; run job;
>> print(expansion)}
>> }
>> In this case, expansion is still true.
>> I will look at hadoop counter and report back here later.
>> 
>> Thank you for all your help
>> Wei
>> 
>> -Original Message-
>> From: Arindam Khaled [mailto:akha...@utdallas.edu]
>> Sent: Friday, December 17, 2010 10:35 AM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Please help with hadoop configuration parameter set and get
>> 
>> I did something like this using a global static boolean variable
>> (flag) while I was implementing breadth first IDA*. In my case, I set
>> the flag to something else if a solution was found, which was examined
>> in the reducer.
>> 
>> I guess in your case, since you know that if the mappers don't produce
>> anything the reducers won't have anything as input, if I am not wrong.
>> 
>> And I had chaining map-reduce jobs (
>> http://developer.yahoo.com/hadoop/tutorial/module4.html
>> ) running until a solution was found.
>> 
>> 
>> Kind regards,
>> 
>> Arindam Khaled
>> 
>> 
>> 
>> 
>> 
>> On Dec 17, 2010, at 12:58 AM, Peng, Wei wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> I am a newbie of hadoop.
>>> 
>>> Today I was struggling with a hadoop problem for several hours.
>>> 
>>> 
>>> 
>>> I initialize a parameter by setting job configuration in main.
>>> 
>>> E.g. Configuration con = new Configuration();
>>> 
>>> con.set("test", "1");
>>> 
>>> Job job = new Job(con);
>>> 
>>> 
>>> 
>>> Then in the mapper class, I want to set "test" to "2". I did it by
>>> 
>>> context.getConfiguration().set("test","2");
>>> 
>>> 
>>> 
>>> Finally in the main method, after the job is finished, I check the
>>> "test" again by
>>> 
>>> job.getConfiguration().get("test");
>>> 
>>> 
>>> 
>>> However, the value of "test" is still "1".
>>> 
>>> 
>>> 
>>> The reason why I want to change the parameter inside Mapper class is
>>> that I want to determine when to stop an iteration in the main method.
>>> For example, for doing breadth-first search, when there is no new
>>> nodes
>>> are added for further expansion, the searching iteration should stop.
>>> 
>>> 
>>> 
>>> Your help will be deeply appreciated. Thank you
>>> 
>>> 
>>> 
>>> Wei
>>> 
>> 
>> 



Re: Friends of friends with MapReduce

2010-12-19 Thread James Seigel
I may be wrong but it seems that you are approaching this problem like
you would in the normal programming world and rightly so.

Depending on the set of the data ( triangle square or other ), you
could do some simple things like spit out the right side of your pair
as the key and the left side as the value.

Then all the pairs with 41 will show up at the same reducer and you
can do you you magic there.  Of course this depends on the setup of
your data.

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2010-12-19, at 7:29 PM, Praveen Bathala  wrote:

> Hi,
>
> I am trying to write a MapReduce code to find friends of friends in a social 
> network with MapReduce.
> my data snippet :
>
> 141
> 17
> 1100
> 264
> 265
> 242
> 286
> 354
> 324
> 316
> 343
> 439
> 452
>
> Here map() goes through line by line, now I get "41" as friend to "1" and I 
> want to go to all "41"'s to get his friends. how can I do that.
> Can I jump to any line I need in Mapper class?
>
> Please, Help me with this...
>
> Thanks in advance
> + Praveen


Re: Hadoop/Elastic MR on AWS

2010-12-27 Thread James Seigel
Thank you for sharing.

Sent from my mobile. Please excuse the typos.

On 2010-12-27, at 11:18 AM, Sudhir Vallamkondu
 wrote:

> We recently crossed this bridge and here are some insights. We did an
> extensive study comparing costs and benchmarking local vs EMR for our
> current needs and future trend.
>
> - Scalability you get with EMR is unmatched although you need to look at
> your requirement and decide this is something you need.
>
> - When using EMR its cheaper to use reserved instances vs nodes on the fly.
> You can always add more nodes when required. I suggest looking at your
> current computing needs and reserve instances for a year or two and use
> these to run EMR and add nodes at peak needs. In your cost estimation you
> will need to factor in the data transfer time/costs unless you are dealing
> with public datasets on S3
>
> - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
> benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
> benchmark). For IO intensive jobs you will need to add more nodes to
> compensate this.
>
> - When compared to local cluster, you will need to factor the time it takes
> for the EMR cluster to setup when starting a job. This like data transfer
> time, cluster replication time etc
>
> - EMR API is very flexible however you will need to build a custom interface
> on top of it to suit your job management and monitoring needs
>
> - EMR bootstrap actions can satisfy most of your native lib needs so no
> drawbacks there.
>
>
> -- Sudhir
>
>
> On 12/26/10 5:26 AM, "common-user-digest-h...@hadoop.apache.org"
>  wrote:
>
>> From: Otis Gospodnetic 
>> Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST)
>> To: 
>> Subject: Re: Hadoop/Elastic MR on AWS
>>
>> Hello Amandeep,
>>
>>
>>
>> - Original Message 
>>> From: Amandeep Khurana 
>>> To: common-user@hadoop.apache.org
>>> Sent: Fri, December 10, 2010 1:14:45 AM
>>> Subject: Re: Hadoop/Elastic MR on AWS
>>>
>>> Mark,
>>>
>>> Using EMR makes it very easy to start a cluster and add/reduce  capacity as
>>> and when required. There are certain optimizations that make EMR  an
>>> attractive choice as compared to building your own cluster out. Using  EMR
>>
>>
>> Could you please point out what optimizations you are referring to?
>>
>> Thanks,
>> Otis
>> 
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
>> Hadoop ecosystem search :: http://search-hadoop.com/
>>
>>> also ensures you are using a production quality, stable system backed by  
>>> the
>>> EMR engineers. You can always use bootstrap actions to put your own  tweaked
>>> version of Hadoop in there if you want to do that.
>>>
>>> Also, you  don't have to tear down your cluster after every job. You can set
>>> the alive  option when you start your cluster and it will stay there even
>>> after your  Hadoop job completes.
>>>
>>> If you face any issues with EMR, send me a mail  offline and I'll be happy 
>>> to
>>> help.
>>>
>>> -Amandeep
>>>
>>>
>>> On Thu, Dec 9,  2010 at 9:47 PM, Mark   wrote:
>>>
 Does anyone have any thoughts/experiences on running Hadoop  in AWS? What
 are some pros/cons?

 Are there any good  AMI's out there for this?

 Thanks for any advice.

>>>
>
>
> iCrossing Privileged and Confidential Information
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential and privileged information of iCrossing. Any 
> unauthorized review, use, disclosure or distribution is prohibited. If you 
> are not the intended recipient, please contact the sender by reply email and 
> destroy all copies of the original message.
>
>


Re: UI doesn't work

2010-12-27 Thread James Seigel
Two quick questions first.

Is the job tracker running on that machine?
Is there a firewall in the way?

James

Sent from my mobile. Please excuse the typos.

On 2010-12-27, at 4:46 PM, maha  wrote:

> Hi,
>
>  I get Error 404 when I try to use hadoop UI to monitor my job execution. I'm 
> using Hadoop-0.20.2 and the following are parts of my configuration files.
>
> in Core-site.xml:
>fs.default.name
>hdfs://speed.cs.ucsb.edu:9000
>
> in mapred-site.xml:
>mapred.job.tracker
>speed.cs.ucsb.edu:9001
>
>
> when I try to open:  http://speed.cs.ucsb.edu:50070/   I get the 404 Error.
>
>
> Any ideas?
>
>  Thank you,
> Maha
>


Re: ClassNotFoundException

2010-12-28 Thread James Seigel
jar -tvf the jar file and double check that it is a class that is
listed. Can't be in an included jar file.

Sent from my mobile. Please excuse the typos.

On 2010-12-28, at 7:58 AM, "Cavus,M.,Fa. Post Direkt"
 wrote:

> Hi,
>
> I process this command: ./hadoop jar /home/userme/hd.jar
> org.postdirekt.hadoop.WordCount gutenberg gutenberberg-output
>
>
>
> and get this why? Because I have org.postdirekt.hadoop.Map in the jar
> File.
>
>
>
> 10/12/28 15:28:30 INFO mapreduce.Job: Task Id :
> attempt_201012281524_0002_m_00_0, Status : FAILED
>
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> org.postdirekt.hadoop.Map
>
>at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1128)
>
>at
> org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContex
> tImpl.java:167)
>
>at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:612)
>
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
>
>at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at javax.security.auth.Subject.doAs(Subject.java:396)
>
>at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
> n.java:742)
>
>at org.apache.hadoop.mapred.Child.main(Child.java:211)
>
> Caused by: java.lang.ClassNotFoundException: org.postdirekt.hadoop.Map
>
>at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at
> java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>
>at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>
>at sun.m
>
> 10/12/28 15:28:41 INFO mapreduce.Job: Task Id :
> attempt_201012281524_0002_m_00_1, Status : FAILED
>
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> org.postdirekt.hadoop.Map
>
>at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1128)
>
>at
> org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContex
> tImpl.java:167)
>
>at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:612)
>
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
>
>at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at javax.security.auth.Subject.doAs(Subject.java:396)
>
>at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
> n.java:742)
>
>at org.apache.hadoop.mapred.Child.main(Child.java:211)
>
> Caused by: java.lang.ClassNotFoundException: org.postdirekt.hadoop.Map
>
>at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at
> java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>
>at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>
>at sun.m
>
> 10/12/28 15:28:53 INFO mapreduce.Job: Task Id :
> attempt_201012281524_0002_m_00_2, Status : FAILED
>
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> org.postdirekt.hadoop.Map
>
>at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1128)
>
>at
> org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContex
> tImpl.java:167)
>
>at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:612)
>
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
>
>at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at javax.security.auth.Subject.doAs(Subject.java:396)
>
>at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
> n.java:742)
>
>at org.apache.hadoop.mapred.Child.main(Child.java:211)
>
> Caused by: java.lang.ClassNotFoundException: org.postdirekt.hadoop.Map
>
>at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at
> java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>
>at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>
>at sun.m
>
> 10/12/28 15:29:09 INFO mapreduce.Job: Job complete:
> job_201012281524_0002
>
> 10/12/28 15:29:09 INFO mapreduce.Job: Counters: 7
>
>Job Counters
>
>Data-local map tasks=4
>
>Total time spent by all maps waiting after
> reserving slots (ms)=0
>
>Total time spent by all reduces waiting after
> reserving slots (ms)=0
>
>Failed map tasks=1
>
>SLOT

Re: UI doesn't work

2010-12-28 Thread James Seigel
For job tracker go to port 50030 see if that helps

James

Sent from my mobile. Please excuse the typos.

On 2010-12-28, at 1:36 PM, maha  wrote:

> James said:
>
> Is the job tracker running on that machine?YES
> Is there a firewall in the way?  I don't think so, because it used to work 
> for me. How can I check that?
>
> 
> Harsh said:
>
> Did you do any ant operation on your release copy of Hadoop prior to
> starting it, by the way?
>
>   NO, I get the following error:
>
> BUILD FAILED
> /cs/sandbox/student/maha/hadoop-0.20.2/build.xml:316: Unable to find a javac 
> compiler;
> com.sun.tools.javac.Main is not on the classpath.
> Perhaps JAVA_HOME does not point to the JDK.
> It is currently set to "/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0/jre"
>
>  I had to change JAVA_HOME to point to --> /usr/lib/jvm/jre-1.6.0-openjdk   
> because I used to get an error when trying to run a jar file. The error was:
>
>> bin/hadoop: line 258: /etc/alternatives/java/bin/java: Not a directory
>> bin/hadoop: line 289: /etc/alternatives/java/bin/java: Not a directory
>> bin/hadoop: line 289: exec: /etc/alternatives/java/bin/java: cannot
>> execute: Not a directory
>
>
> 
> Adarsh said:
>
>  logs of namenode + jobtracker
>
> < namenode log 
>
> [m...@speed logs]$ cat hadoop-maha-namenode-speed.cs.ucsb.edu.log
> 2010-12-28 12:23:25,006 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> STARTUP_MSG:
> /
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = speed.cs.ucsb.edu/128.111.43.50
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.20.2
> STARTUP_MSG:   build = 
> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 
> 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
> /
> 2010-12-28 12:23:25,126 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
> Initializing RPC Metrics with hostName=NameNode, port=9000
> 2010-12-28 12:23:25,130 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Namenode up at: speed.cs.ucsb.edu/128.111.43.50:9000
> 2010-12-28 12:23:25,133 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
> Initializing JVM Metrics with processName=NameNode, sessionId=null
> 2010-12-28 12:23:25,134 INFO 
> org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing 
> NameNodeMeterics using context 
> object:org.apache.hadoop.metrics.spi.NullContext
> 2010-12-28 12:23:25,258 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=maha,grad
> 2010-12-28 12:23:25,258 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
> 2010-12-28 12:23:25,258 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
> 2010-12-28 12:23:25,269 INFO 
> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: 
> Initializing FSNamesystemMetrics using context 
> object:org.apache.hadoop.metrics.spi.NullContext
> 2010-12-28 12:23:25,270 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered 
> FSNamesystemStatusMBean
> 2010-12-28 12:23:25,316 INFO org.apache.hadoop.hdfs.server.common.Storage: 
> Number of files = 6
> 2010-12-28 12:23:25,323 INFO org.apache.hadoop.hdfs.server.common.Storage: 
> Number of files under construction = 0
> 2010-12-28 12:23:25,323 INFO org.apache.hadoop.hdfs.server.common.Storage: 
> Image file of size 551 loaded in 0 seconds.
> 2010-12-28 12:23:25,323 INFO org.apache.hadoop.hdfs.server.common.Storage: 
> Edits file /tmp/hadoop-maha/dfs/name/current/edits of size 4 edits # 0 loaded 
> in 0 seconds.
> 2010-12-28 12:23:25,358 INFO org.apache.hadoop.hdfs.server.common.Storage: 
> Image file of size 551 saved in 0 seconds.
> 2010-12-28 12:23:25,711 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage 
> in 542 msecs
> 2010-12-28 12:23:25,715 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe 
> mode ON.
> The ratio of reported blocks 0. has not reached the threshold 0.9990. 
> Safe mode will be turned off automatically.
> 2010-12-28 12:23:25,834 INFO org.mortbay.log: Logging to 
> org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via 
> org.mortbay.log.Slf4jLog
> 2010-12-28 12:23:25,901 INFO org.apache.hadoop.http.HttpServer: Port returned 
> by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening 
> the listener on 50070
> 2010-12-28 12:23:25,902 INFO org.apache.hadoop.http.HttpServer: 
> listener.getLocalPort() returned 50070 
> webServer.getConnectors()[0].getLocalPort() returned 50070
> 2010-12-28 12:23:25,902 INFO org.apache.hadoop.http.HttpServer: Jetty bound 
> to port 50070
> 2010-12-28 12:23:25

Re: UI doesn't work

2010-12-28 Thread James Seigel
Nope, just on my iPhone I thought you'd tried a different port ( bad memory :) )

Try accessing it with an ip address you get from doing an ipconfig on
the machine.

Then look at the logs and see if there are any errors or indications
that it is being hit properly.

Does your browser follow redirects properly?  As well try clearing the
cache on your browser.

Sorry for checking out the obvious stuff but sometimes it is :).

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2010-12-28, at 2:30 PM, maha  wrote:

> Hi James,
>
>   I'm accessing  ---> http://speed.cs.ucsb.edu:50030/   for the job tracker 
> and  port: 50070 for the name node just like Hadoop quick start.
>
> Did you mean to change the port in my mapred-site.xml file ?
>
>  
>mapred.job.tracker
>speed.cs.ucsb.edu:9001
>  
>
>
> Maha
>
>
> On Dec 28, 2010, at 1:01 PM, James Seigel wrote:
>
>> For job tracker go to port 50030 see if that helps
>>
>> James
>>
>> Sent from my mobile. Please excuse the typos.
>>
>> On 2010-12-28, at 1:36 PM, maha  wrote:
>>
>>> James said:
>>>
>>> Is the job tracker running on that machine?YES
>>> Is there a firewall in the way?  I don't think so, because it used to work 
>>> for me. How can I check that?
>>>
>>> 
>>> Harsh said:
>>>
>>> Did you do any ant operation on your release copy of Hadoop prior to
>>> starting it, by the way?
>>>
>>> NO, I get the following error:
>>>
>>> BUILD FAILED
>>> /cs/sandbox/student/maha/hadoop-0.20.2/build.xml:316: Unable to find a 
>>> javac compiler;
>>> com.sun.tools.javac.Main is not on the classpath.
>>> Perhaps JAVA_HOME does not point to the JDK.
>>> It is currently set to "/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0/jre"
>>>
>>> I had to change JAVA_HOME to point to --> /usr/lib/jvm/jre-1.6.0-openjdk   
>>> because I used to get an error when trying to run a jar file. The error was:
>>>
>>>> bin/hadoop: line 258: /etc/alternatives/java/bin/java: Not a directory
>>>> bin/hadoop: line 289: /etc/alternatives/java/bin/java: Not a directory
>>>> bin/hadoop: line 289: exec: /etc/alternatives/java/bin/java: cannot
>>>> execute: Not a directory
>>>
>>>
>>> 
>>> Adarsh said:
>>>
>>> logs of namenode + jobtracker
>>>
>>> <<<<< namenode log >>>>
>>>
>>> [m...@speed logs]$ cat hadoop-maha-namenode-speed.cs.ucsb.edu.log
>>> 2010-12-28 12:23:25,006 INFO 
>>> org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
>>> /
>>> STARTUP_MSG: Starting NameNode
>>> STARTUP_MSG:   host = speed.cs.ucsb.edu/128.111.43.50
>>> STARTUP_MSG:   args = []
>>> STARTUP_MSG:   version = 0.20.2
>>> STARTUP_MSG:   build = 
>>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 
>>> 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
>>> /
>>> 2010-12-28 12:23:25,126 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
>>> Initializing RPC Metrics with hostName=NameNode, port=9000
>>> 2010-12-28 12:23:25,130 INFO 
>>> org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: 
>>> speed.cs.ucsb.edu/128.111.43.50:9000
>>> 2010-12-28 12:23:25,133 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
>>> Initializing JVM Metrics with processName=NameNode, sessionId=null
>>> 2010-12-28 12:23:25,134 INFO 
>>> org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: 
>>> Initializing NameNodeMeterics using context 
>>> object:org.apache.hadoop.metrics.spi.NullContext
>>> 2010-12-28 12:23:25,258 INFO 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=maha,grad
>>> 2010-12-28 12:23:25,258 INFO 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
>>> 2010-12-28 12:23:25,258 INFO 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>>> isPermissionEnabled=true
>>> 2010-12-28 12:23:25,269 INFO 
>>> org.apache.hadoop.hdfs.server.nameno

Re: help for using mapreduce to run different code?

2010-12-28 Thread James Seigel
Not sure what you mean.

Can you write custom code for your map functions?: yes

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2010-12-28, at 3:54 PM, Jander g  wrote:

> Hi, all
>
> Whether Hadoop supports the map function running different code? If yes, how
> to realize this?
>
> Thanks in advance!
>
> --
> Regards,
> Jander


Re: help for using mapreduce to run different code?

2010-12-28 Thread James Seigel
Simple answer is:  submit two jobs.

Hadoop is designed to run many tasks simultaneously.

Cheers
J

Sent from my mobile. Please excuse the typos.

On 2010-12-28, at 4:09 PM, Jander g  wrote:

> Hi James,
>
> Thanks for your attention.
>
> Suppose there are only 2 map running in Hadoop cluster, I want to using one
> map to sort and another to wordcount in the same time in the same Hadoop
> cluster.
>
> On Wed, Dec 29, 2010 at 6:58 AM, James Seigel  wrote:
>
>> Not sure what you mean.
>>
>> Can you write custom code for your map functions?: yes
>>
>> Cheers
>> James
>>
>> Sent from my mobile. Please excuse the typos.
>>
>> On 2010-12-28, at 3:54 PM, Jander g  wrote:
>>
>>> Hi, all
>>>
>>> Whether Hadoop supports the map function running different code? If yes,
>> how
>>> to realize this?
>>>
>>> Thanks in advance!
>>>
>>> --
>>> Regards,
>>> Jander
>>
>
>
>
> --
> Thanks,
> Jander


Re: Retrying connect to server

2010-12-30 Thread James Seigel
Or
3) The configuration (or lack thereof) on the machine you are trying to 
run this, has no idea where your DFS or JobTracker  is :)

Cheers
James.

On 2010-12-30, at 8:53 PM, Adarsh Sharma wrote:

> Cavus,M.,Fa. Post Direkt wrote:
>> I process this
>> 
>> ./hadoop jar ../../hadoopjar/hd.jar org.postdirekt.hadoop.WordCount 
>> gutenberg gutenberg-output
>> 
>> I get this
>> Dıd anyone know why I get this Error?
>> 
>> 10/12/30 16:48:59 INFO security.Groups: Group mapping 
>> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; 
>> cacheTimeout=30
>> 10/12/30 16:49:01 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 0 time(s).
>> 10/12/30 16:49:02 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 1 time(s).
>> 10/12/30 16:49:03 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 2 time(s).
>> 10/12/30 16:49:04 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 3 time(s).
>> 10/12/30 16:49:05 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 4 time(s).
>> 10/12/30 16:49:06 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 5 time(s).
>> 10/12/30 16:49:07 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 6 time(s).
>> 10/12/30 16:49:08 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 7 time(s).
>> 10/12/30 16:49:09 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 8 time(s).
>> 10/12/30 16:49:10 INFO ipc.Client: Retrying connect to server: 
>> localhost/127.0.0.1:9001. Already tried 9 time(s).
>> Exception in thread "main" java.net.ConnectException: Call to 
>> localhost/127.0.0.1:9001 failed on connection exception: 
>> java.net.ConnectException: Connection refused
>>  at org.apache.hadoop.ipc.Client.wrapException(Client.java:932)
>>  at org.apache.hadoop.ipc.Client.call(Client.java:908)
>>  at 
>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
>>  at $Proxy0.getProtocolVersion(Unknown Source)
>>  at 
>> org.apache.hadoop.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:228)
>>  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:224)
>>  at org.apache.hadoop.mapreduce.Cluster.createRPCProxy(Cluster.java:82)
>>  at org.apache.hadoop.mapreduce.Cluster.createClient(Cluster.java:94)
>>  at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:70)
>>  at org.apache.hadoop.mapreduce.Job.(Job.java:129)
>>  at org.apache.hadoop.mapreduce.Job.(Job.java:134)
>>  at org.postdirekt.hadoop.WordCount.main(WordCount.java:19)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>  at java.lang.reflect.Method.invoke(Method.java:597)
>>  at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
>> Caused by: java.net.ConnectException: Connection refused
>>  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>  at 
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>>  at 
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373)
>>  at 
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:417)
>>  at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:207)
>>  at org.apache.hadoop.ipc.Client.getConnection(Client.java:1025)
>>  at org.apache.hadoop.ipc.Client.call(Client.java:885)
>>  ... 15 more
>>  
> This is the most common issue occured after configuring Hadoop Cluster.
> 
> Reason :
> 
> 1. Your NameNode, JobTracker is not running. Verify through Web UI and jps 
> commands.
> 2. DNS Resolution. You must have IP hostname enteries if all nodes in 
> /etc/hosts file.
> 
> 
> 
> Best Regards
> 
> Adarsh Sharma



Re: Help for the problem of running lucene on Hadoop

2010-12-31 Thread James Seigel
Check out katta for an example

J

Sent from my mobile. Please excuse the typos.

On 2010-12-31, at 4:47 PM, Lance Norskog  wrote:

> This will not work for indexing. Lucene requires random read/write to
> a file and HDFS does not support this. HDFS only allows sequential
> writes: you start at the beginninig and copy the file in to block 0,
> block 1,...block N.
>
> For querying, if your HDFS implementation makes a local cache that
> appears as a file system (I think FUSE does this?) it might work well.
> But, yes, you should copy it down.
>
> On Fri, Dec 31, 2010 at 4:43 AM, Zhou, Yunqing  wrote:
>> You should implement the Directory class by your self.
>> Nutch provided one, named HDFSDirectory.
>> You can use it to build the index, but when doing search on HDFS, it is
>> relatively slower, especially on phrase queries.
>> I recommend you to download it to disk when performing a search.
>>
>> On Fri, Dec 31, 2010 at 5:08 PM, Jander g  wrote:
>>
>>> Hi, all
>>>
>>> I want  to run lucene on Hadoop, The problem as follows:
>>>
>>> IndexWriter writer = new IndexWriter(FSDirectory.open(new
>>> File("index")),new StandardAnalyzer(), true,
>>> IndexWriter.MaxFieldLength.LIMITED);
>>>
>>> when using Hadoop, whether the first param must be the dir of HDFS? And how
>>> to use?
>>>
>>> Thanks in advance!
>>>
>>> --
>>> Regards,
>>> Jander
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com


Re: Help for the problem of running lucene on Hadoop

2011-01-01 Thread James Seigel
Well.   Depending on the size of you cluster it could be or it might not.

If you have 100 machines with 8 maps each trying to connect to Mysql.
You might get some hilarity.

If you have three machines you won't knock your Mysql instance over.

Sent from my mobile. Please excuse the typos.

On 2010-12-31, at 10:56 PM, Jander g  wrote:

> Thanks for all the above reply.
>
> Now my idea is: running word segmentation on Hadoop and creating the
> inverted index in mysql. As we know, Hadoop MR supports writing and reading
> to mysql.
>
> Does this have any problem?
>
> On Sat, Jan 1, 2011 at 7:49 AM, James Seigel  wrote:
>
>> Check out katta for an example
>>
>> J
>>
>> Sent from my mobile. Please excuse the typos.
>>
>> On 2010-12-31, at 4:47 PM, Lance Norskog  wrote:
>>
>>> This will not work for indexing. Lucene requires random read/write to
>>> a file and HDFS does not support this. HDFS only allows sequential
>>> writes: you start at the beginninig and copy the file in to block 0,
>>> block 1,...block N.
>>>
>>> For querying, if your HDFS implementation makes a local cache that
>>> appears as a file system (I think FUSE does this?) it might work well.
>>> But, yes, you should copy it down.
>>>
>>> On Fri, Dec 31, 2010 at 4:43 AM, Zhou, Yunqing 
>> wrote:
>>>> You should implement the Directory class by your self.
>>>> Nutch provided one, named HDFSDirectory.
>>>> You can use it to build the index, but when doing search on HDFS, it is
>>>> relatively slower, especially on phrase queries.
>>>> I recommend you to download it to disk when performing a search.
>>>>
>>>> On Fri, Dec 31, 2010 at 5:08 PM, Jander g  wrote:
>>>>
>>>>> Hi, all
>>>>>
>>>>> I want  to run lucene on Hadoop, The problem as follows:
>>>>>
>>>>> IndexWriter writer = new IndexWriter(FSDirectory.open(new
>>>>> File("index")),new StandardAnalyzer(), true,
>>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>>>
>>>>> when using Hadoop, whether the first param must be the dir of HDFS? And
>> how
>>>>> to use?
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Jander
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>
>
>
>
> --
> Thanks,
> Jander


Re: When does Reduce job start

2011-01-04 Thread James Seigel
As the other gentleman said. The reduce task kinda needs to know all
the data is available before doing its work.

By design.

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2011-01-04, at 6:14 PM, sagar naik  wrote:

> Hi Jeff,
>
> To be clear on my end I m not talking abt reduce () function call but
> spawning of reduce process/task itself
> To rephrase:
>   Reduce Process/Task is not started untill 90% of map task are done
>
>
> -Sagar
> On Tue, Jan 4, 2011 at 3:14 PM, Jeff Bean  wrote:
>> It's part of the design that reduce() does not get called until the map
>> phase is complete. You're seeing reduce report as started when map is at 90%
>> complete because hadoop is shuffling data from the mappers that have
>> completed. As currently designed, you can't prematurely start reduce()
>> because there is no way to gaurantee you have all the values for any key
>> until all the mappers are done. reduce() requires a key and all the values
>> for that key in order to execute.
>>
>> Jeff
>>
>>
>> On Tue, Jan 4, 2011 at 10:53 AM, sagar naik  wrote:
>>
>>> Hi All,
>>>
>>> number  of map task: 1000s
>>> number of reduce task:single digit
>>>
>>> In such cases the reduce task wont  started even when few map task are
>>> completed.
>>> Example:
>>> In my observation of a sample run of bin/hadoop jar
>>> hadoop-*examples*.jar pi 1 10, the reduce did not start untill 90%
>>> of map task were complete.
>>>
>>> The only reason, I can think of not starting  a reduce task is to
>>> avoid the un-necessary transfer of map output data in case of
>>> failures.
>>>
>>>
>>> Is there a way to quickly start the reduce task in such case ?
>>> Wht is the configuration param to change this behavior
>>>
>>>
>>>
>>> Thanks,
>>> Sagar
>>>
>>


Re: Can MapReduce run simultaneous producer/consumer processes?

2011-01-06 Thread James Seigel
Not sure if this would work, or the right approach, but looking into hadoop 
streaming, ?might? find something?

Cheers
James.
 
On 2011-01-06, at 3:27 PM, W.P. McNeill wrote:

> Say I have two MapReduce processes, A and B.  The two are algorithmically
> dissimilar, so they have to be implemented as separate MapReduce processes.
> The output of A is used as the input of B, so A has to run first.  However,
> B doesn't need to take all of A's output as input, only a partition of it.
> So in theory A and B could run at the same time in a producer/consumer
> arrangement, where B would start to work as soon as A had produced some
> output but before A had completed.  Obviously, this could be a big
> parallelization win.
> 
> Is this possible in MapReduce?  I know at the most basic level it is
> not–there is no synchronization mechanism that allows the same HDFS
> directory to be used for both input and output–but is there some abstraction
> layer on top that allows it?  I've been digging around, and I think the
> answer is "No" but I want to be sure.
> 
> More specifically, the only abstraction layer I'm aware of that chains
> together MapReduce processes is Cascade, and I think it requires the reduce
> steps to be serialized, but again I'm not sure because I've only read the
> documentation and haven't actually played with it.



Re: Distributed indexing with Hadoop

2011-01-29 Thread James Seigel
Has anyone tried to do the reuters example with both approaches?  I seem to 
have problems getting them to run.

Cheers
James.


On 2011-01-29, at 9:25 AM, Ted Yu wrote:

> $MAHOUT_HOME/examples/bin/build-reuters.shFYI
> 
> On Sat, Jan 29, 2011 at 12:57 AM, Marco Didonna wrote:
> 
>> On 01/29/2011 05:17 AM, Lance Norskog wrote:
>>> Look at the Reuters example in the Mahout project:
>> http://mahout.apache.org
>> 
>> Ehm could you point me to it ? I cannot find it
>> 
>> Thanks
>> 
>> 
>> 



Re: Reduce progress goes backward?

2011-02-01 Thread James Seigel
It means that the scheduler is killing off some of your reducer jobs or some of 
them are dieing.  Maybe they are taking too long.  You should check out your 
job tracker and look at some of the details and then drill down to see if you 
are getting any errors in some of your reducers.

Cheers
James
On 2011-02-01, at 12:17 PM, Shi Yu wrote:

> Hi,
> 
> I observe that sometimes the map/reduce progress is going backward. What does 
> this mean?
> 
> 
> 11/02/01 12:57:51 INFO mapred.JobClient:  map 100% reduce 99%
> 11/02/01 12:59:14 INFO mapred.JobClient:  map 100% reduce 98%
> 11/02/01 12:59:45 INFO mapred.JobClient:  map 100% reduce 99%
> 11/02/01 13:03:24 INFO mapred.JobClient:  map 100% reduce 98%
> 11/02/01 13:09:16 INFO mapred.JobClient:  map 100% reduce 97%
> 11/02/01 13:09:19 INFO mapred.JobClient:  map 100% reduce 96%
> 11/02/01 13:11:14 INFO mapred.JobClient:  map 100% reduce 97%
> 11/02/01 13:12:33 INFO mapred.JobClient:  map 100% reduce 96%
> 11/02/01 13:13:05 INFO mapred.JobClient:  map 100% reduce 95%
> 
> 
> Shi
> 



Re: IdentityReducer is called instead of my own

2011-02-02 Thread James Seigel
Share code from your mapper?

Check to see if there are any errors on the job tracker reports that might 
indicate the inability to find the class.

James.

On 2011-02-02, at 2:23 PM, Christian Kunz wrote:

> Without seeing the source code of the reduce method of  the InvIndexReduce 
> class
> my best guess would be that the signature is incorrect. I saw this happen 
> when migrating from old to new api:
> 
> protected void reduce(KEYIN key, Iterable values, Context context)
> 
> (Iterable, not Iterator as in the old api)
> 
> -Christian
> 
> 
> On Feb 2, 2011, at 12:22 PM, Marco Didonna wrote:
> 
>> Hello everybody,
>> I am experiencing a weird issue: I have written a small hadoop program
>> and I am launching it using this https://gist.github.com/808297
>> JobDriver. Strangely InvIndexReducer is never invoked and the default
>> reducer kicks in. I really cannot understand where the problem could be:
>> as you can see I am using the new version of the api, (hadoop >= 0.20).
>> 
>> Any help is appreciated
>> 
>> MD
>> 
> 



Re: recommendation on HDDs

2011-02-12 Thread James Seigel
The only thing of concern is that the hdfs stuff doesn't seem to do
exceptionally well with different sized disks in practice

James

Sent from my mobile. Please excuse the typos.

On 2011-02-12, at 8:43 AM, Edward Capriolo  wrote:

> On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning  wrote:
>> Bandwidth is definitely better with more active spindles.  I would recommend
>> several larger disks.  The cost is very nearly the same.
>>
>> On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi wrote:
>>
>>> Thanks for your inputs, Michael.  We have 6 open SATA ports on the
>>> motherboards. That is the reason why we are thinking of 4 to 5 data disks
>>> and 1 OS disk.
>>> Are you suggesting use of one 2TB disk instead of four 500GB disks lets
>>> say?
>>> I thought that the HDFS utilization/throughput increases with the # of
>>> disks
>>> per node (assuming that the total usable IO bandwidth increases
>>> proportionally).
>>>
>>> -Shrinivas
>>>
>>> On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel >>> wrote:
>>>

 Shrinivas,

 Assuming you're in the US, I'd recommend the following:

 Go with 2TB 7200 SATA hard drives.
 (Not sure what type of hardware you have)

 What  we've found is that in the data nodes, there's an optimal
 configuration that balances price versus performance.

 While your chasis may hold 8 drives, how many open SATA ports are on the
 motherboard? Since you're using JBOD, you don't want the additional
>>> expense
 of having to purchase a separate controller card for the additional
>>> drives.

 I'm running Seagate drives at home and I haven't had any problems for
 years.
 When you look at your drive, you need to know total storage, speed
>>> (rpms),
 and cache size.
 Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
 1TB Seagate was 70.00
 A 250GB SATA drive was $45.00

 So 2TB = 110, 140, 180 (respectively)

 So you get a better deal on 2TB.

 So if you go out and get more drives but of lower density, you'll end up
 spending more money and use more energy, but I doubt you'll see a real
 performance difference.

 The other thing is that if you want to add more disk, you have room to
 grow. (Just add more disk and restart the node, right?)
 If all of your disk slots are filled, you're SOL. You have to take out
>>> the
 box, replace all of the drives, then add to cluster as 'new' node.

 Just my $0.02 cents.

 HTH

 -Mike

> Date: Thu, 10 Feb 2011 15:47:16 -0600
> Subject: Re: recommendation on HDDs
> From: jshrini...@gmail.com
> To: common-user@hadoop.apache.org
>
> Hi Ted, Chris,
>
> Much appreciate your quick reply. The reason why we are looking for
 smaller
> capacity drives is because we are not anticipating a huge growth in
>>> data
> footprint and also read somewhere that larger the capacity of the
>>> drive,
> bigger the number of platters in them and that could affect drive
> performance. But looks like you can get 1TB drives with only 2
>>> platters.
> Large capacity drives should be OK for us as long as they perform
>>> equally
> well.
>
> Also, the systems that we have can host up to 8 SATA drives in them. In
 that
> case, would  backplanes offer additional advantages?
>
> Any suggestions on 5400 vs. 7200 vs. 1 RPM disks?  I guess 10K rpm
 disks
> would be overkill comparing their perf/cost advantage?
>
> Thanks for your inputs.
>
> -Shrinivas
>
> On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins <
 chris_j_coll...@yahoo.com>wrote:
>
>> Of late we have had serious issues with seagate drives in our hadoop
>> cluster.  These were purchased over several purchasing cycles and
 pretty
>> sure it wasnt just a single "bad batch".   Because of this we
>>> switched
 to
>> buying 2TB hitachi drives which seem to of been considerably more
 reliable.
>>
>> Best
>>
>> C
>> On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:
>>
>>> Get bigger disks.  Data only grows and having extra is always good.
>>>
>>> You can get 2TB drives for <$100 and 1TB for < $75.
>>>
>>> As far as transfer rates are concerned, any 3GB/s SATA drive is
>>> going
 to
>> be
>>> about the same (ish).  Seek times will vary a bit with rotation
 speed,
>> but
>>> with Hadoop, you will be doing long reads and writes.
>>>
>>> Your controller and backplane will have a MUCH bigger vote in
>>> getting
>>> acceptable performance.  With only 4 or 5 drives, you don't have to
 worry
>>> about super-duper backplane, but you can still kill performance
>>> with
 a
>> lousy
>>> controller.
>>>
>>> On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <
 jshrini...@gmail.com
>>> wrote:
>>>
 What wo

Re: Reduce java.lang.OutOfMemoryError

2011-02-16 Thread James Seigel
Well the first thing I'd ask to see (if we can) is the code or a
description of what your reducer is doing.

If it is holding on to objects too long or accumulating lists well
then with the right amount of data you will run OOM.

Another thought is that you've just not allocated enough mem for the
reducer to run properly anyway. Try passing in a setting for the
reducer that ups the memory for it. 768 perhaps.

James

Sent from my mobile. Please excuse the typos.

On 2011-02-16, at 8:12 AM, Kelly Burkhart  wrote:

> I have had it fail with a single reducer and with 100 reducers.
> Ultimately it needs to be funneled to a single reducer though.
>
> -K
>
> On Wed, Feb 16, 2011 at 9:02 AM, real great..
>  wrote:
>> Hi,
>> How many reducers are you using currently?
>> Try increasing the number or reducers.
>> Let me know if it helps.
>>
>> On Wed, Feb 16, 2011 at 8:30 PM, Kelly Burkhart 
>> wrote:
>>
>>> Hello, I'm seeing frequent fails in reduce jobs with errors similar to
>>> this:
>>>
>>>
>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
>>> header: attempt_201102081823_0175_m_002153_0, compressed len: 172492,
>>> decompressed len: 172488
>>> 2011-02-15 15:21:10,163 FATAL org.apache.hadoop.mapred.TaskRunner:
>>> attempt_201102081823_0175_r_34_0 : Map output copy failure :
>>> java.lang.OutOfMemoryError: Java heap space
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>>>
>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
>>> Shuffling 172488 bytes (172492 raw bytes) into RAM from
>>> attempt_201102081823_0175_m_002153_0
>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
>>> header: attempt_201102081823_0175_m_002118_0, compressed len: 161944,
>>> decompressed len: 161940
>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
>>> header: attempt_201102081823_0175_m_001704_0, compressed len: 228365,
>>> decompressed len: 228361
>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask: Task
>>> attempt_201102081823_0175_r_34_0: Failed fetch #1 from
>>> attempt_201102081823_0175_m_002153_0
>>> 2011-02-15 15:21:10,424 FATAL org.apache.hadoop.mapred.TaskRunner:
>>> attempt_201102081823_0175_r_34_0 : Map output copy failure :
>>> java.lang.OutOfMemoryError: Java heap space
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>>>
>>> Some also show this:
>>>
>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>at sun.net.www.http.ChunkedInputStream.(ChunkedInputStream.java:63)
>>>at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:811)
>>>at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
>>>at
>>> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1072)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>at
>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>>>
>>> The particular job I'm running is an attempt to merge multiple time
>>> series files into a single file.  The job tracker shows the following:
>>>
>>>
>>> KindNum TasksComplete   KilledFailed/Killed Task Attempts
>>> map 1579515795  0 0 / 29
>>> reduce  100  30 7017 / 29
>>>
>>> All of the files I'm reading have records with a timestamp key similar to:
>>>
>>> 2011-01-03 08:30:00.457000
>>>
>>> My map job is a simple python program that ignores rows with times <
>>> 08:30:00 and > 15:00:00, determines the type of input row and writes
>>> it to stdout with very minor modification.  It maintains no state and
>>> should not use any significant memory.  My reducer is the
>>> IdentityReducer.  The input files are individually gzipped then put
>>> into hdfs.  The total uncompressed size of the output should be around
>>> 150G.  Our cluster is 32 no

Re: Reduce java.lang.OutOfMemoryError

2011-02-16 Thread James Seigel
...oh sorry I didn't scroll below the exception the first time. Try part 2

James

Sent from my mobile. Please excuse the typos.

On 2011-02-16, at 8:00 AM, Kelly Burkhart  wrote:

> Hello, I'm seeing frequent fails in reduce jobs with errors similar to this:
>
>
> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
> header: attempt_201102081823_0175_m_002153_0, compressed len: 172492,
> decompressed len: 172488
> 2011-02-15 15:21:10,163 FATAL org.apache.hadoop.mapred.TaskRunner:
> attempt_201102081823_0175_r_34_0 : Map output copy failure :
> java.lang.OutOfMemoryError: Java heap space
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
> Shuffling 172488 bytes (172492 raw bytes) into RAM from
> attempt_201102081823_0175_m_002153_0
> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
> header: attempt_201102081823_0175_m_002118_0, compressed len: 161944,
> decompressed len: 161940
> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
> header: attempt_201102081823_0175_m_001704_0, compressed len: 228365,
> decompressed len: 228361
> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask: Task
> attempt_201102081823_0175_r_34_0: Failed fetch #1 from
> attempt_201102081823_0175_m_002153_0
> 2011-02-15 15:21:10,424 FATAL org.apache.hadoop.mapred.TaskRunner:
> attempt_201102081823_0175_r_34_0 : Map output copy failure :
> java.lang.OutOfMemoryError: Java heap space
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
> Some also show this:
>
> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
>at sun.net.www.http.ChunkedInputStream.(ChunkedInputStream.java:63)
>at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:811)
>at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
>at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1072)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1447)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1349)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
> The particular job I'm running is an attempt to merge multiple time
> series files into a single file.  The job tracker shows the following:
>
>
> KindNum TasksComplete   KilledFailed/Killed Task Attempts
> map 1579515795  0 0 / 29
> reduce  100  30 7017 / 29
>
> All of the files I'm reading have records with a timestamp key similar to:
>
> 2011-01-03 08:30:00.457000
>
> My map job is a simple python program that ignores rows with times <
> 08:30:00 and > 15:00:00, determines the type of input row and writes
> it to stdout with very minor modification.  It maintains no state and
> should not use any significant memory.  My reducer is the
> IdentityReducer.  The input files are individually gzipped then put
> into hdfs.  The total uncompressed size of the output should be around
> 150G.  Our cluster is 32 nodes each of which has 16G RAM and most of
> which have two 2T drives.  We're running hadoop 0.20.2.
>
>
> Can anyone provide some insight on how we can eliminate this issue?
> I'm certain this email does not provide enough info, please let me
> know what further information is needed to troubleshoot.
>
> Thanks in advance,
>
> -Kelly


Re: Reduce java.lang.OutOfMemoryError

2011-02-16 Thread James Seigel
He might not have that conf distributed out to each machine


Sent from my mobile. Please excuse the typos.

On 2011-02-16, at 9:10 AM, Kelly Burkhart  wrote:

> Our clust admin (who's out of town today) has mapred.child.java.opts
> set to -Xmx1280 in mapred-site.xml.  However, if I go to the job
> configuration page for a job I'm running right now, it claims this
> option is set to -Xmx200m.  There are other settings in
> mapred-site.xml that are different too.  Why would map/reduce jobs not
> respect the mapred-site.xml file?
>
> -K
>
> On Wed, Feb 16, 2011 at 9:43 AM, Jim Falgout  
> wrote:
>> You can set the amount of memory used by the reducer using the 
>> mapreduce.reduce.java.opts property. Set it in mapred-site.xml or override 
>> it in your job. You can set it to something like: -Xm512M to increase the 
>> amount of memory used by the JVM spawned for the reducer task.
>>
>> -Original Message-
>> From: Kelly Burkhart [mailto:kelly.burkh...@gmail.com]
>> Sent: Wednesday, February 16, 2011 9:12 AM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Reduce java.lang.OutOfMemoryError
>>
>> I have had it fail with a single reducer and with 100 reducers.
>> Ultimately it needs to be funneled to a single reducer though.
>>
>> -K
>>
>> On Wed, Feb 16, 2011 at 9:02 AM, real great..
>>  wrote:
>>> Hi,
>>> How many reducers are you using currently?
>>> Try increasing the number or reducers.
>>> Let me know if it helps.
>>>
>>> On Wed, Feb 16, 2011 at 8:30 PM, Kelly Burkhart 
>>> wrote:
>>>
 Hello, I'm seeing frequent fails in reduce jobs with errors similar
 to
 this:


 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
 header: attempt_201102081823_0175_m_002153_0, compressed len: 172492,
 decompressed len: 172488
 2011-02-15 15:21:10,163 FATAL org.apache.hadoop.mapred.TaskRunner:
 attempt_201102081823_0175_r_34_0 : Map output copy failure :
 java.lang.OutOfMemoryError: Java heap space
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuf
 fleInMemory(ReduceTask.java:1508)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM
 apOutput(ReduceTask.java:1408)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy
 Output(ReduceTask.java:1261)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(
 ReduceTask.java:1195)

 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
 Shuffling 172488 bytes (172492 raw bytes) into RAM from
 attempt_201102081823_0175_m_002153_0
 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
 header: attempt_201102081823_0175_m_002118_0, compressed len: 161944,
 decompressed len: 161940
 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
 header: attempt_201102081823_0175_m_001704_0, compressed len: 228365,
 decompressed len: 228361
 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
 Task
 attempt_201102081823_0175_r_34_0: Failed fetch #1 from
 attempt_201102081823_0175_m_002153_0
 2011-02-15 15:21:10,424 FATAL org.apache.hadoop.mapred.TaskRunner:
 attempt_201102081823_0175_r_34_0 : Map output copy failure :
 java.lang.OutOfMemoryError: Java heap space
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuf
 fleInMemory(ReduceTask.java:1508)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM
 apOutput(ReduceTask.java:1408)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy
 Output(ReduceTask.java:1261)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(
 ReduceTask.java:1195)

 Some also show this:

 Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
at
 sun.net.www.http.ChunkedInputStream.(ChunkedInputStream.java:63)
at
 sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:811)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLCon
 nection.java:1072)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getI
 nputStream(ReduceTask.java:1447)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM
 apOutput(ReduceTask.java:1349)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy
 Output(ReduceTask.java:1261)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(
 ReduceTask.java:1195)

 The particular job I'm running is an attempt to merge multiple time
 series files into a single file.  The job tracker shows the followin

Re: Reduce java.lang.OutOfMemoryError

2011-02-16 Thread James Seigel
Hrmmm. Well as you've pointed out. 200m is quite small and is probably
the cause.

Now thEre might be some overriding settings in something you are using
to launch or something.

You could set those values in the config to not be overridden in the
main conf then see what tries to override it in the logs

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2011-02-16, at 9:21 AM, Kelly Burkhart  wrote:

> I should have mentioned this in my last email: I thought of that so I
> logged into every machine in the cluster; each machine's
> mapred-site.xml has the same md5sum.
>
> On Wed, Feb 16, 2011 at 10:15 AM, James Seigel  wrote:
>> He might not have that conf distributed out to each machine
>>
>>
>> Sent from my mobile. Please excuse the typos.
>>
>> On 2011-02-16, at 9:10 AM, Kelly Burkhart  wrote:
>>
>>> Our clust admin (who's out of town today) has mapred.child.java.opts
>>> set to -Xmx1280 in mapred-site.xml.  However, if I go to the job
>>> configuration page for a job I'm running right now, it claims this
>>> option is set to -Xmx200m.  There are other settings in
>>> mapred-site.xml that are different too.  Why would map/reduce jobs not
>>> respect the mapred-site.xml file?
>>>
>>> -K
>>>
>>> On Wed, Feb 16, 2011 at 9:43 AM, Jim Falgout  
>>> wrote:
>>>> You can set the amount of memory used by the reducer using the 
>>>> mapreduce.reduce.java.opts property. Set it in mapred-site.xml or override 
>>>> it in your job. You can set it to something like: -Xm512M to increase the 
>>>> amount of memory used by the JVM spawned for the reducer task.
>>>>
>>>> -Original Message-
>>>> From: Kelly Burkhart [mailto:kelly.burkh...@gmail.com]
>>>> Sent: Wednesday, February 16, 2011 9:12 AM
>>>> To: common-user@hadoop.apache.org
>>>> Subject: Re: Reduce java.lang.OutOfMemoryError
>>>>
>>>> I have had it fail with a single reducer and with 100 reducers.
>>>> Ultimately it needs to be funneled to a single reducer though.
>>>>
>>>> -K
>>>>
>>>> On Wed, Feb 16, 2011 at 9:02 AM, real great..
>>>>  wrote:
>>>>> Hi,
>>>>> How many reducers are you using currently?
>>>>> Try increasing the number or reducers.
>>>>> Let me know if it helps.
>>>>>
>>>>> On Wed, Feb 16, 2011 at 8:30 PM, Kelly Burkhart 
>>>>> wrote:
>>>>>
>>>>>> Hello, I'm seeing frequent fails in reduce jobs with errors similar
>>>>>> to
>>>>>> this:
>>>>>>
>>>>>>
>>>>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>> header: attempt_201102081823_0175_m_002153_0, compressed len: 172492,
>>>>>> decompressed len: 172488
>>>>>> 2011-02-15 15:21:10,163 FATAL org.apache.hadoop.mapred.TaskRunner:
>>>>>> attempt_201102081823_0175_r_34_0 : Map output copy failure :
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>at
>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuf
>>>>>> fleInMemory(ReduceTask.java:1508)
>>>>>>at
>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM
>>>>>> apOutput(ReduceTask.java:1408)
>>>>>>at
>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy
>>>>>> Output(ReduceTask.java:1261)
>>>>>>at
>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(
>>>>>> ReduceTask.java:1195)
>>>>>>
>>>>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>> Shuffling 172488 bytes (172492 raw bytes) into RAM from
>>>>>> attempt_201102081823_0175_m_002153_0
>>>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>> header: attempt_201102081823_0175_m_002118_0, compressed len: 161944,
>>>>>> decompressed len: 161940
>>>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>> header: attempt_201102081823_0175_m_001704_0, compressed len: 228365,
>>>>>> decompressed len: 228361
>>>>&

Re: Reduce java.lang.OutOfMemoryError

2011-02-16 Thread James Seigel
Good luck.

Let me know how it goes.

James

Sent from my mobile. Please excuse the typos.

On 2011-02-16, at 11:11 AM, Kelly Burkhart  wrote:

> OK, the job was preferring the config file on my local machine which
> is not part of the cluster over the cluster config files.  That seems
> completely broken to me; my config was basically empty other than
> containing the location of the cluster and my job apparently used
> defaults rather than the cluster config.  It doesn't make sense to me
> to keep configuration files synchronized on every machine that may
> access the cluster.
>
> I'm running again; we'll see if it completes this time.
>
> -K
>
> On Wed, Feb 16, 2011 at 10:30 AM, James Seigel  wrote:
>> Hrmmm. Well as you've pointed out. 200m is quite small and is probably
>> the cause.
>>
>> Now thEre might be some overriding settings in something you are using
>> to launch or something.
>>
>> You could set those values in the config to not be overridden in the
>> main conf then see what tries to override it in the logs
>>
>> Cheers
>> James
>>
>> Sent from my mobile. Please excuse the typos.
>>
>> On 2011-02-16, at 9:21 AM, Kelly Burkhart  wrote:
>>
>>> I should have mentioned this in my last email: I thought of that so I
>>> logged into every machine in the cluster; each machine's
>>> mapred-site.xml has the same md5sum.
>>>
>>> On Wed, Feb 16, 2011 at 10:15 AM, James Seigel  wrote:
>>>> He might not have that conf distributed out to each machine
>>>>
>>>>
>>>> Sent from my mobile. Please excuse the typos.
>>>>
>>>> On 2011-02-16, at 9:10 AM, Kelly Burkhart  wrote:
>>>>
>>>>> Our clust admin (who's out of town today) has mapred.child.java.opts
>>>>> set to -Xmx1280 in mapred-site.xml.  However, if I go to the job
>>>>> configuration page for a job I'm running right now, it claims this
>>>>> option is set to -Xmx200m.  There are other settings in
>>>>> mapred-site.xml that are different too.  Why would map/reduce jobs not
>>>>> respect the mapred-site.xml file?
>>>>>
>>>>> -K
>>>>>
>>>>> On Wed, Feb 16, 2011 at 9:43 AM, Jim Falgout  
>>>>> wrote:
>>>>>> You can set the amount of memory used by the reducer using the 
>>>>>> mapreduce.reduce.java.opts property. Set it in mapred-site.xml or 
>>>>>> override it in your job. You can set it to something like: -Xm512M to 
>>>>>> increase the amount of memory used by the JVM spawned for the reducer 
>>>>>> task.
>>>>>>
>>>>>> -Original Message-
>>>>>> From: Kelly Burkhart [mailto:kelly.burkh...@gmail.com]
>>>>>> Sent: Wednesday, February 16, 2011 9:12 AM
>>>>>> To: common-user@hadoop.apache.org
>>>>>> Subject: Re: Reduce java.lang.OutOfMemoryError
>>>>>>
>>>>>> I have had it fail with a single reducer and with 100 reducers.
>>>>>> Ultimately it needs to be funneled to a single reducer though.
>>>>>>
>>>>>> -K
>>>>>>
>>>>>> On Wed, Feb 16, 2011 at 9:02 AM, real great..
>>>>>>  wrote:
>>>>>>> Hi,
>>>>>>> How many reducers are you using currently?
>>>>>>> Try increasing the number or reducers.
>>>>>>> Let me know if it helps.
>>>>>>>
>>>>>>> On Wed, Feb 16, 2011 at 8:30 PM, Kelly Burkhart 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello, I'm seeing frequent fails in reduce jobs with errors similar
>>>>>>>> to
>>>>>>>> this:
>>>>>>>>
>>>>>>>>
>>>>>>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>> header: attempt_201102081823_0175_m_002153_0, compressed len: 172492,
>>>>>>>> decompressed len: 172488
>>>>>>>> 2011-02-15 15:21:10,163 FATAL org.apache.hadoop.mapred.TaskRunner:
>>>>>>>> attempt_201102081823_0175_r_34_0 : Map output copy failure :
>>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>>at
>>>>>>>> org.apache.hadoop.mapred.ReduceTask$R

Re: Trouble in installing Hbase

2011-02-24 Thread James Seigel
You probably should ask on the cloudera support forums as cloudera has
for some reason changed the users that things run under.

James

Sent from my mobile. Please excuse the typos.

On 2011-02-24, at 8:00 AM, JAGANADH G  wrote:

> Hi All
>
> I was trying to install CDH3 Hhase in Fedora14 .
> It gives the following error. Any solution to resolve this
> Transaction Test Succeeded
> Running Transaction
> Error in PREIN scriptlet in rpm package hadoop-hbase-0.90.1+8-1.noarch
> /usr/bin/install: invalid user `hbase'
> /usr/bin/install: invalid user `hbase'
> error: %pre(hadoop-hbase-0.90.1+8-1.noarch) scriptlet failed, exit status 1
> error:   install: %pre scriptlet failed (2), skipping
> hadoop-hbase-0.90.1+8-1
>
> Failed:
>  hadoop-hbase.noarch
> 0:0.90.1+8-1
>
>
> Complete!
> [root@linguist hexp]#
>
> --
> **
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
> *ILUGCBE*
> http://ilugcbe.techstud.org


Re: Check lzo is working on intermediate data

2011-02-24 Thread James Seigel
Run a standard job before. Look at the summary data.

Run the job again after the changes and look at the summary.

You should see less file system bytes written from the map stage.
Sorry, might be most obvious in shuffle bytes.

I don't have a terminal in front of me right now.

James

Sent from my mobile. Please excuse the typos.

On 2011-02-24, at 8:22 AM, Marc Sturlese  wrote:

>
> Hey there,
> I am using hadoop 0.20.2. I 've successfully installed LZOCompression
> following these steps:
> https://github.com/kevinweil/hadoop-lzo
>
> I have some MR jobs written with the new API and I want to compress
> intermediate data.
> Not sure if my mapred-site.xml should have the properties:
>
>  
>mapred.compress.map.output
>true
>  
>  
>mapred.map.output.compression.codec
>com.hadoop.compression.lzo.LzoCodec
>  
>
> or:
>
>  
>mapreduce.map.output.compress
>true
>  
>  
>mapreduce.map.output.compress.codec
>com.hadoop.compression.lzo.LzoCodec
>  
>
> How can I check that the compression is been applied?
>
> Thanks in advance
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Check-lzo-is-working-on-intermediate-data-tp2567704p2567704.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Packaging for Hadoop - what about the Hadoop libraries?

2011-02-25 Thread James Seigel
The ones that are present.

It is a little tricky for the other ones however, well not really once you “get 
it”

-libjars  on the commandline will ship the 
“supporting” jars out with the job to the map reducers, however if you, for 
some reason need them in the job submission they won’t be present,  you either 
need to have those in the command line classpath or bundled.

Cheers
James.


On 2011-02-25, at 6:06 AM, Mark Kerzner wrote:

> Hi,
> 
> when packaging additional libraries for an MR job, I can use a script or a
> Maven Hadoop plugin, but what about the Hadoop libraries themselves? Should
> I package them in, or should I rely on those jars that are already present
> in the Hadoop installation where the code will be running? What is the best
> practice?
> 
> Thank you,
> Mark



Re: Catching mapred exceptions on the client

2011-02-25 Thread James Seigel
Hello,

It is hard to give advice without the specific code.  However, if you don’t 
have your job submission set up to wait for completion then it might be 
launching all your jobs at the same time.

Check to see how your jobs are being submitted.

Sorry, I can’t be more helpful.

James


On 2011-02-25, at 9:00 AM,  wrote:

> Hello all,
> I have few mapreduce jobs that I am calling from a java driver. The problem I 
> am facing is that when there is an exception in mapred job, the exception is 
> not propogated to the client so even if first job failed, its going to second 
> job and so on. Is there an another way of catching exceptions from mapred 
> jobs on the client side?
> 
> I am using hadoop-0.20.2.
> 
> My Example is:
> Driver {
> try {
>Call MapredJob1;
>Call MapredJob2;
>..
>..
>}catch(Exception e) {
>throw new exception;
>}
> }
> 
> When MapredJob1 throws ClassNotFoundException, MapredJob2 and others are 
> still executing.
> 
> Any insight into it is appreciated.
> 
> Praveen
> 



Re: TaskTracker not starting on all nodes

2011-02-26 Thread James Seigel
Maybe your ssh keys aren’t distributed the same on each machine or the machines 
aren’t configured the same?

J


On 2011-02-26, at 8:25 AM, bikash sharma wrote:

> Hi,
> I have a 10 nodes Hadoop cluster, where I am running some benchmarks for
> experiments.
> Surprisingly, when I initialize the Hadoop cluster
> (hadoop/bin/start-mapred.sh), in many instances, only some nodes have
> TaskTracker process up (seen using jps), while other nodes do not have
> TaskTrackers. Could anyone please explain?
> 
> Thanks,
> Bikash



Re: TaskTracker not starting on all nodes

2011-02-27 Thread James Seigel
 Did you get it working?  What was the fix?

Sent from my mobile. Please excuse the typos.

On 2011-02-27, at 8:43 PM, Simon  wrote:

> Hey Bikash,
>
> Maybe you can manually start a  tasktracker on the node and see if there are
> any error messages. Also, don't forget to check your configure files for
> mapreduce and hdfs and make sure datanode can start successfully first.
> After all these steps, you can submit a job on the master node and see if
> there are any communication between these failed nodes and the master node.
> Post your error messages here if possible.
>
> HTH.
> Simon -
>
> On Sat, Feb 26, 2011 at 10:44 AM, bikash sharma 
> wrote:
>
>> Thanks James. Well all the config. files and shared keys are on a shared
>> storage that is accessed by all the nodes in the cluster.
>> At times, everything runs fine on initialization, but at other times, the
>> same problem persists, so was bit confused.
>> Also, checked the TaskTracker logs on those nodes, there does not seem to
>> be
>> any error.
>>
>> -bikash
>>
>> On Sat, Feb 26, 2011 at 10:30 AM, James Seigel  wrote:
>>
>>> Maybe your ssh keys aren’t distributed the same on each machine or the
>>> machines aren’t configured the same?
>>>
>>> J
>>>
>>>
>>> On 2011-02-26, at 8:25 AM, bikash sharma wrote:
>>>
>>>> Hi,
>>>> I have a 10 nodes Hadoop cluster, where I am running some benchmarks
>> for
>>>> experiments.
>>>> Surprisingly, when I initialize the Hadoop cluster
>>>> (hadoop/bin/start-mapred.sh), in many instances, only some nodes have
>>>> TaskTracker process up (seen using jps), while other nodes do not have
>>>> TaskTrackers. Could anyone please explain?
>>>>
>>>> Thanks,
>>>> Bikash
>>>
>>>
>>
>
>
>
> --
> Regards,
> Simon


Re: why quick sort when spill map output?

2011-02-28 Thread James Seigel
Sorting out of the map phase is core to how hadoop works.  Are you asking why 
sort at all?  or why did someone use quick sort as opposed to _sort?

Cheers
James


On 2011-02-28, at 3:30 AM, elton sky wrote:

> Hello forumers,
> 
> Before spill the data in kvbuffer to local disk in map task, k/v are
> sorted using quick sort. The complexity of quick sort is O(nlogn) and
> worst case is O(n^2).
> Why using quick sort?
> 
> Regards



Re: Comparison between Gzip and LZO

2011-03-02 Thread James Seigel
slightly not on point for this conversation, but I thought it worth 
mentioningLZO is splitable, which makes it a good for for hadoopy things.  
Just something to remember when you do get some final results on performance.

Cheers
James.


On 2011-03-02, at 8:12 PM, Brian Bockelman wrote:

> 
> I think some profiling is in order: claiming LZO decompresses at 1.0MB/s and 
> is more than 3x faster at compression than decompression (especially when 
> it's a well known asymmetric algorithm in favor of decompression speed) is 
> somewhat unbelievable.
> 
> I see that you use small files.  Maybe whatever you do for LZO and 
> Gzip/Hadoop has a large startup overhead?
> 
> Again, sounds like you'll be spending an hour or so with a profiler.
> 
> Brian
> 
> On Mar 2, 2011, at 2:16 PM, Niels Basjes wrote:
> 
>> Question: Are you 100% sure that nothing else was running on that
>> system during the tests?
>> No cron jobs, no "makewhatis" or "updatedb"?
>> 
>> P.S. There is a permission issue with downloading one of the files.
>> 
>> 2011/3/2 José Vinícius Pimenta Coletto :
>>> Hi,
>>> 
>>> I'm making a comparison between the following compression methods: gzip
>>> and lzo provided by Hadoop and gzip from package java.util.zip.
>>> The test consists of compression and decompression of approximately 92,000
>>> files with an average size of 2kb, however the decompression time of lzo is
>>> twice the decompression time of gzip provided by Hadoop, it does not seem
>>> right.
>>> The results obtained in the test are:
>>> 
>>> Method |   Bytes   |   Compression
>>>  |Decompression
>>>-   | - | Total Time(with i/o)  Time Speed
>>>   | Total Time(with i/o)  Time  Speed
>>> Gzip (Haddop)| 200876304 | 121.454s  43.167s
>>> 4,653,424.079 B/s | 332.305s  111.806s   1,796,635.326 B/s
>>> Lzo  | 200876304 | 120.564s  54.072s
>>> 3,714,914.621 B/s | 509.371s  184.906s   1,086,368.904 B/s
>>> Gzip (java.util.zip) | 200876304 | 148.014s  63.414s
>>> 3,167,647.371 B/s | 483.148s  4.528s44,360,682.244 B/s
>>> 
>>> You can see the code I'm using to the test here:
>>> http://www.linux.ime.usp.br/~jvcoletto/compression/
>>> 
>>> Can anyone explain me why am I getting these results?
>>> Thanks.
>>> 
>> 
>> 
>> 
>> -- 
>> Met vriendelijke groeten,
>> 
>> Niels Basjes
> 



Re: k-means

2011-03-04 Thread James Seigel
Mahout project?

Sent from my mobile. Please excuse the typos.

On 2011-03-04, at 6:41 AM, MANISH SINGLA  wrote:

> Hey ppl...
> I need some serious help...I m not able to run kmeans code in
> hadoop...does anyone have a running code...that they would have
> tried...
>
> Regards
> MANISH


Re: k-means

2011-03-04 Thread James Seigel
I am not near a computer so I won't be able to give you specifics.  So
instead, I'd suggest Manning's "mahout in action" book which is in
their early access form for some basic direction.

Disclosure: I have no relation to the publisher or authors.

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2011-03-04, at 7:37 AM, MANISH SINGLA  wrote:

> are u suggesting me that???  if yes can u plzzz tell me the steps to
> use that...because I havent used it yet...a quick reply will really be
> appreciated...
> Thanx
> Manish
>
> On Fri, Mar 4, 2011 at 7:39 PM, James Seigel  wrote:
>> Mahout project?
>>
>> Sent from my mobile. Please excuse the typos.
>>
>> On 2011-03-04, at 6:41 AM, MANISH SINGLA  wrote:
>>
>>> Hey ppl...
>>> I need some serious help...I m not able to run kmeans code in
>>> hadoop...does anyone have a running code...that they would have
>>> tried...
>>>
>>> Regards
>>> MANISH
>>


Re: k-means

2011-03-04 Thread James Seigel
Manning site. You can download it and get a paper copy when it comes
out if you'd like.

James

Sent from my mobile. Please excuse the typos.

On 2011-03-04, at 7:53 AM, Mike Nute  wrote:

> James,
>
> Do you know how to get a copy of this book in early access form? Amazon 
> doesn't release it until may.  Thanks!
>
> Mike Nute
> --Original Message--
> From: James Seigel
> To: common-user@hadoop.apache.org
> ReplyTo: common-user@hadoop.apache.org
> Subject: Re: k-means
> Sent: Mar 4, 2011 9:46 AM
>
> I am not near a computer so I won't be able to give you specifics.  So
> instead, I'd suggest Manning's "mahout in action" book which is in
> their early access form for some basic direction.
>
> Disclosure: I have no relation to the publisher or authors.
>
> Cheers
> James
>
> Sent from my mobile. Please excuse the typos.
>
> On 2011-03-04, at 7:37 AM, MANISH SINGLA  wrote:
>
>> are u suggesting me that???  if yes can u plzzz tell me the steps to
>> use that...because I havent used it yet...a quick reply will really be
>> appreciated...
>> Thanx
>> Manish
>>
>> On Fri, Mar 4, 2011 at 7:39 PM, James Seigel  wrote:
>>> Mahout project?
>>>
>>> Sent from my mobile. Please excuse the typos.
>>>
>>> On 2011-03-04, at 6:41 AM, MANISH SINGLA  wrote:
>>>
>>>> Hey ppl...
>>>> I need some serious help...I m not able to run kmeans code in
>>>> hadoop...does anyone have a running code...that they would have
>>>> tried...
>>>>
>>>> Regards
>>>> MANISH
>>>
>


Re: TaskTracker not starting on all nodes

2011-03-04 Thread James Seigel
Sounds like just a bit more work on understanding ssh will get you there.

What you are looking for is getting that public key into authorized_keys

James

Sent from my mobile. Please excuse the typos.

On 2011-03-04, at 2:58 AM, MANISH SINGLA  wrote:

> Hii all,
> I am trying to setup a 2 node cluster...I have configured all the
> files as specified in the tutorial I am refering to...I copied the
> public key to the slave's machine...but when I ssh to the slave from
> the master, it asks for password everytime...kindly help...
>
> On Fri, Mar 4, 2011 at 11:12 AM, icebergs  wrote:
>> You can check the logs whose tasktracker isn't up.
>> The path is "HADOOP_HOME/logs/".
>> The answer may be in it.
>>
>> 2011/3/2 bikash sharma 
>>
>>> Hi Sonal,
>>> Thanks. I guess you are right. ps -ef exposes such processes.
>>>
>>> -bikash
>>>
>>> On Tue, Mar 1, 2011 at 1:29 PM, Sonal Goyal  wrote:
>>>
>>>> Bikash,
>>>>
>>>> I have sometimes found hanging processes which jps does not report, but a
>>>> ps -ef shows them. Maybe you can check this on the errant nodes..
>>>>
>>>> Thanks and Regards,
>>>> Sonal
>>>> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data Integration<
>>> https://github.com/sonalgoyal/hiho>
>>>> Nube Technologies <http://www.nubetech.co>
>>>>
>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 1, 2011 at 7:37 PM, bikash sharma >>> wrote:
>>>>
>>>>> Hi James,
>>>>> Sorry for the late response. No, the same problem persists. I
>>> reformatted
>>>>> HDFS, stopped mapred and hdfs daemons and restarted them (using
>>>>> start-dfs.sh
>>>>> and start-mapred.sh from master node). But surprisingly out of 4 nodes
>>>>> cluster, two nodes have TaskTracker running while other two do not have
>>>>> TaskTrackers on them (verified using jps). I guess since I have the
>>> Hadoop
>>>>> installed on shared storage, that might be the issue? Btw, how do I
>>> start
>>>>> the services independently on each node?
>>>>>
>>>>> -bikash
>>>>> On Sun, Feb 27, 2011 at 11:05 PM, James Seigel  wrote:
>>>>>
>>>>>>  Did you get it working?  What was the fix?
>>>>>>
>>>>>> Sent from my mobile. Please excuse the typos.
>>>>>>
>>>>>> On 2011-02-27, at 8:43 PM, Simon  wrote:
>>>>>>
>>>>>>> Hey Bikash,
>>>>>>>
>>>>>>> Maybe you can manually start a  tasktracker on the node and see if
>>>>> there
>>>>>> are
>>>>>>> any error messages. Also, don't forget to check your configure files
>>>>> for
>>>>>>> mapreduce and hdfs and make sure datanode can start successfully
>>>>> first.
>>>>>>> After all these steps, you can submit a job on the master node and
>>> see
>>>>> if
>>>>>>> there are any communication between these failed nodes and the
>>> master
>>>>>> node.
>>>>>>> Post your error messages here if possible.
>>>>>>>
>>>>>>> HTH.
>>>>>>> Simon -
>>>>>>>
>>>>>>> On Sat, Feb 26, 2011 at 10:44 AM, bikash sharma <
>>>>> sharmabiks...@gmail.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks James. Well all the config. files and shared keys are on a
>>>>> shared
>>>>>>>> storage that is accessed by all the nodes in the cluster.
>>>>>>>> At times, everything runs fine on initialization, but at other
>>> times,
>>>>>> the
>>>>>>>> same problem persists, so was bit confused.
>>>>>>>> Also, checked the TaskTracker logs on those nodes, there does not
>>>>> seem
>>>>>> to
>>>>>>>> be
>>>>>>>> any error.
>>>>>>>>
>>>>>>>> -bikash
>>>>>>>>
>>>>>>>> On Sat, Feb 26, 2011 at 10:30 AM, James Seigel 
>>>>> wrote:
>>>>>>>>
>>>>>>>>> Maybe your ssh keys aren’t distributed the same on each machine or
>>>>> the
>>>>>>>>> machines aren’t configured the same?
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2011-02-26, at 8:25 AM, bikash sharma wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I have a 10 nodes Hadoop cluster, where I am running some
>>>>> benchmarks
>>>>>>>> for
>>>>>>>>>> experiments.
>>>>>>>>>> Surprisingly, when I initialize the Hadoop cluster
>>>>>>>>>> (hadoop/bin/start-mapred.sh), in many instances, only some nodes
>>>>> have
>>>>>>>>>> TaskTracker process up (seen using jps), while other nodes do not
>>>>> have
>>>>>>>>>> TaskTrackers. Could anyone please explain?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Bikash
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Simon
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>


YYC/Calgary/Alberta Hadoop Users?

2011-03-07 Thread James Seigel
Hello,

Just wondering if there are any YYC hadoop users in the crowd and if
there is any interest in a meetup of any sort?

Cheers
James Seigel
Captain Hammer
Tynt


Re: How to count rows of output files ?

2011-03-08 Thread James Seigel
Simplest case, if you need a sum of the lines for A,B, and C is to
look at the output that is normally generated which tells you "Reduce
output records".  This can be accessed like the others are telling
you, as a counter, which you could access and explicitly print out or
with your eyes as the summary of the job when it is done.

Cheers
James.

On Tue, Mar 8, 2011 at 3:29 AM, Harsh J  wrote:
> I think the previous reply wasn't very accurate. So you need a count
> per-file? One way I can think of doing that, via the job itself, is to
> use Counter to count the "name of the output + the task's ID". But it
> would not be a good solution if there are several hundreds of tasks.
>
> A distributed count can be performed on a single file, however, using
> an identity mapper + null output and then looking at map-input-records
> counter after completion.
>
> On Tue, Mar 8, 2011 at 3:54 PM, Harsh J  wrote:
>> Count them as you sink using the Counters functionality of Hadoop
>> Map/Reduce (If you're using MultipleOutputs, it has a way to enable
>> counters for each name used). You can then aggregate related counters
>> post-job, if needed.
>>
>> On Tue, Mar 8, 2011 at 3:11 PM, Jun Young Kim  wrote:
>>> Hi.
>>>
>>> my hadoop application generated several output files by a single job.
>>> (for example, A, B, C are generated as a result)
>>>
>>> after finishing a job, I want to count each files' row counts.
>>>
>>> is there any way to count each files?
>>>
>>> thanks.
>>>
>>> --
>>> Junyoung Kim (juneng...@gmail.com)
>>>
>>>
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>>
>
>
>
> --
> Harsh J
> www.harshj.com
>


Re: Setting up hadoop on a cluster

2011-03-10 Thread James Seigel
How many nodes?

Sent from my mobile. Please excuse the typos.

On 2011-03-10, at 7:05 AM, "Lai  Will"  wrote:

> Hello,
>
> Currently I've been playing around with my single node cluster.
> I'm planning to test my code on a real cluster in the next few weeks.
>
> I've read some manuals on how to deploy it. It seems that still a lot has to 
> be done manually.
> As the cluster I will be working on will probably format their nodes quite 
> frequently, I'm looking for the simplest, mostly or even fully automatic 
> installation procedure that allows me to quickly reinstall hadoop on the 
> cluster.
>
> What are your approaches?
>
> Best,
> Will


Re: Setting up hadoop on a cluster

2011-03-10 Thread James Seigel
Sorry, and where are you hosting the cluster?  Cloud? Physical? Garage?

Sent from my mobile. Please excuse the typos.

On 2011-03-10, at 7:05 AM, "Lai  Will"  wrote:

> Hello,
>
> Currently I've been playing around with my single node cluster.
> I'm planning to test my code on a real cluster in the next few weeks.
>
> I've read some manuals on how to deploy it. It seems that still a lot has to 
> be done manually.
> As the cluster I will be working on will probably format their nodes quite 
> frequently, I'm looking for the simplest, mostly or even fully automatic 
> installation procedure that allows me to quickly reinstall hadoop on the 
> cluster.
>
> What are your approaches?
>
> Best,
> Will


Re: Question regardin the block size and the way that a block is used in Hadoop

2011-03-12 Thread James Seigel
Yes.  Just your "FAT" is consumed.

Sent from my mobile. Please excuse the typos.

On 2011-03-12, at 11:04 AM, Florin Picioroaga  wrote:

> Hello!
>  I've been reading in the "Hadoop Definitive guide" by Tom White
> about the block emptiness when a file is not large enough to occupy the full 
> size of the block.
>
> From the statement (cite from the book)
> "Unlike a filesystem for a single disk, a file in HDFS that is smaller than a 
> single block does not occupy a full block’s worth of underlying storage." I 
> can understand the physical space left from the initial block size will be 
> free. My question is can the underlying operating reuse/write this remained 
> free space?
> I'll look forward for your answers.
> Thank you,
>  Florin
>
>
>
>


Re: Hadoop EC2 setup

2011-03-12 Thread James Seigel
Do you have the amazon tools installed and in the appropriate path?

James

Sent from my mobile. Please excuse the typos.

On 2011-03-12, at 11:04 AM, JJ siung  wrote:

> Hi,
>
> I am following a setup guide here: http://wiki.apache.org/hadoop/AmazonEC2 but
> runs into problems when I tried to launch a cluster.
> An error message said
> "hadoop-0.21.0/common/src/contrib/ec2/bin/launch-hadoop-master: line 40:
> ec2-describe-instances: command not found"
> I am not even sure if I edited the hadoop-ec2-env.sh correctly. Is there any
> newer tutorial for setting this up?
>
> Thanks!


Re: YYC/Calgary/Alberta Hadoop Users?

2011-03-16 Thread James Seigel
Hello again.

I am guessing with the lack of response that there are either no hadoop people 
from Calgary, or they are afraid to meetup :)

How about just speaking up if you use hadoop in Calgary :)

Cheers
James.
\
On 2011-03-07, at 8:40 PM, James Seigel wrote:

> Hello,
> 
> Just wondering if there are any YYC hadoop users in the crowd and if
> there is any interest in a meetup of any sort?
> 
> Cheers
> James Seigel
> Captain Hammer
> Tynt



Re: Cloudera Flume

2011-03-16 Thread James Seigel
I believe sir there should be a flume support group on cloudera. I'm
guessing most of us here haven't used it and therefore aren't  much
help.

This is vanilla hadoop land. :)

Cheers and good luck!
James

On a side note, how much data are you pumping through it?


Sent from my mobile. Please excuse the typos.

On 2011-03-16, at 7:53 PM, Mark  wrote:

> Sorry if this is not the correct list to post this on, it was the closest I 
> could find.
>
> We are using a taildir('/var/log/foo/') source on all of our agents. If this 
> agent goes down and data can not be sent to the collector for some time, what 
> happens when this agent becomes available again? Will the agent tail the 
> whole directory starting from the beginning of all files thus adding 
> duplicate data to our sink?
>
> I've read that I could set the startFromEnd parameter to true. In that case 
> however if an agent goes down then we would lose any data that gets written 
> to our file until the agent comes back up. How do people handle this? It 
> seems like you either have to deal with the fact that you will have duplicate 
> or missing data.
>
> Thanks||


Re: Hadoop Testing?

2011-03-17 Thread James Seigel
Anandkumar

Count the records in.  See if the same number show up in hdfs.

number the records in.  See if the sequential set of numbers show up in hdfs.

Other than that I am not sure what kinds of things you’d like to test.

Scaling:  What kind of scaling?  Size?  requests/sec?  data in / sec..

reliability?  What kinds of measures are you thinking of?  With replication set 
to 3, in the most unlucky scenario you can have two machines go down and you 
still won’t lose data.

Performance of mapreduce, a lot of people run terasort/teragen to get a 
baseline of how their changes affected a cluster (more nodes, configuration 
tweaks, etc...)

I hope this helps a little.

Cheers
James
On 2011-03-17, at 11:48 AM, Anandkumar R wrote:

> Dear Friends,
> 
> I am Anandkumar, working as a test engineer in eBay and we use Hadoop
> extensively to store our log. I am in situation to validate or test our Data
> perfectly reaching the Hadoop infrastructure or not. could anyone of you
> recommend me the best testing methodologies and if there is any existing
> framework for testing Hadoop, please recommend to me.
> 
> My scenario is simple of Client will dump millions of data to Hadoop, I need
> to validate that the data has reached Hadoop perfectly and also there is not
> Data loss and also other testing like scalability and reliability.
> 
> Anticipating your support
> 
> Thanks,
> Anandkumar



Re: decommissioning node woes

2011-03-18 Thread James Seigel
Just a note.  If you just shut the node off, the blocks will replicate faster.

James.


On 2011-03-18, at 10:03 AM, Ted Dunning wrote:

> If nobody else more qualified is willing to jump in, I can at least provide
> some pointers.
> 
> What you describe is a bit surprising.  I have zero experience with any 0.21
> version, but decommissioning was working well
> in much older versions, so this would be a surprising regression.
> 
> The observations you have aren't all inconsistent with how decommissioning
> should work.  The fact that your nodes look up
> after starting the decommissioning isn't so strange.  The idea is that no
> new data will be put on the node, nor should it be
> counted as a replica, but it will help in reading data.
> 
> So that isn't such a big worry.
> 
> The fact that it takes forever and a day, however, is a big worry.  I cannot
> provide any help there just off hand.
> 
> What happens when a datanode goes down?  Do you see under-replicated files?
> Does the number of such files decrease over time?
> 
> On Fri, Mar 18, 2011 at 4:23 AM, Rita  wrote:
> 
>> Any help?
>> 
>> 
>> On Wed, Mar 16, 2011 at 9:36 PM, Rita  wrote:
>> 
>>> Hello,
>>> 
>>> I have been struggling with decommissioning data  nodes. I have a 50+
>> data
>>> node cluster (no MR) with each server holding about 2TB of storage. I
>> split
>>> the nodes into 2 racks.
>>> 
>>> 
>>> I edit the 'exclude' file and then do a -refreshNodes. I see the node
>>> immediate in 'Decommiosied node' and I also see it as a 'live' node!
>>> Eventhough I wait 24+ hours its still like this. I am suspecting its a
>> bug
>>> in my version.  The data node process is still running on the node I am
>>> trying to decommission. So, sometimes I kill -9 the process and I see the
>>> 'under replicated' blocks...this can't be the normal procedure.
>>> 
>>> There were even times that I had corrupt blocks because I was impatient
>> --
>>> waited 24-34 hours
>>> 
>>> I am using 23 August, 2010: release 0.21.0 <
>> http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available
>>> 
>>> version.
>>> 
>>> Is this a known bug? Is there anything else I need to do to decommission
>> a
>>> node?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> --- Get your facts first, then you can distort them as you please.--
>>> 
>> 
>> 
>> 
>> --
>> --- Get your facts first, then you can distort them as you please.--
>> 



Re: decommissioning node woes

2011-03-18 Thread James Seigel
I agree.

J

On 2011-03-18, at 11:34 AM, Ted Dunning wrote:

> I like to keep that rather high.  If I am decommissioning nodes, I generally
> want them out of the cluster NOW.
> 
> That is probably a personality defect on my part.
> 
> On Fri, Mar 18, 2011 at 9:59 AM, Michael Segel 
> wrote:
> 
>> Once you see those blocks successfully replicated... you can take down the
>> next.
>> 
>> Is it clean? No, not really.
>> Is it dangerous? No, not really.
>> Do I recommend it? No, but its a quick and dirty way of doing things...
>> 
>> Or you can up your dfs.balance.bandwidthPerSecIn the configuration files.
>> The default is pretty low.
>> 
>> The downside is that you have to bounce the cloud to get this value
>> updated, and it could have a negative impact on performance if set too high.
>> 



Re: Backupnode web UI showing upgrade status..

2011-03-22 Thread James Seigel
Ther is a step which flips a bit. "finalizeupgrade" or something that
needs to be run.

Should be straight forward

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2011-03-22, at 7:32 AM, Gokulakannan M  wrote:

> Hi all,
>
> A newbie question reg backupnode . I just started the namenode and
> backupnode and in backupnode web UI it shows "Upgrade for version -24 has
> been completed. Upgrade is not finalized". I did not run any upgrade. Can
> anyone please clarify as this message is confusing..
>
>
>
> Thanks,
>
>  Gokul
>
>
>


Re: Is there any way to add jar when invoking hadoop command

2011-03-22 Thread James Seigel
Hello, some quick advice for you


which portion of your job needs the jar?  if answer = mapper or reducer, then 
add it to the -libjars flag.

If it is in the job initiation..bundle it in your job jar for fun.

Cheers
James.
On 2011-03-22, at 7:35 PM, Jeff Zhang wrote:

> Another work around I can think of is that have my own copy of hadoop, and
> copy extra jars to my hadoop. But it result into more maintenance effort
> 
> On Wed, Mar 23, 2011 at 9:19 AM, Jeff Zhang  wrote:
> 
>> Hi all,
>> 
>> When I use command "hadoop fs -text" I need to add extra jar to CLASSPATH,
>> because there's custom type in my sequence file. One way is that copying jar
>> to $HADOOP_HOME/lib
>> But in my case, I am not administrator, so I do not have the permission to
>> copy files under $HADOOP_HOME/lib
>> 
>> Is there any other ways to add extra jar to CLASSPATH ?
>> 
>> 
>> --
>> Best Regards
>> 
>> Jeff Zhang
>> 
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang



Re: Is there any way to add jar when invoking hadoop command

2011-03-22 Thread James Seigel
Ah lib jars should do...let me see if I can dig up an example.


hadoop fs -libjars ./adhoc/x.jar -text 
//xxx/xxx//20110101/part-r-1.evt


cheers
James.



On 2011-03-22, at 8:01 PM, Jeff Zhang wrote:

> Actually I don't use the jar for mapreduce job. I only need it to display
> sequence file.
> 
> 
> 
> On Wed, Mar 23, 2011 at 9:41 AM, James Seigel  wrote:
> 
>> Hello, some quick advice for you
>> 
>> 
>> which portion of your job needs the jar?  if answer = mapper or reducer,
>> then add it to the -libjars flag.
>> 
>> If it is in the job initiation..bundle it in your job jar for fun.
>> 
>> Cheers
>> James.
>> On 2011-03-22, at 7:35 PM, Jeff Zhang wrote:
>> 
>>> Another work around I can think of is that have my own copy of hadoop,
>> and
>>> copy extra jars to my hadoop. But it result into more maintenance effort
>>> 
>>> On Wed, Mar 23, 2011 at 9:19 AM, Jeff Zhang  wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> When I use command "hadoop fs -text" I need to add extra jar to
>> CLASSPATH,
>>>> because there's custom type in my sequence file. One way is that copying
>> jar
>>>> to $HADOOP_HOME/lib
>>>> But in my case, I am not administrator, so I do not have the permission
>> to
>>>> copy files under $HADOOP_HOME/lib
>>>> 
>>>> Is there any other ways to add extra jar to CLASSPATH ?
>>>> 
>>>> 
>>>> --
>>>> Best Regards
>>>> 
>>>> Jeff Zhang
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Best Regards
>>> 
>>> Jeff Zhang
>> 
>> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang



Re: CDH and Hadoop

2011-03-23 Thread James Seigel
If you are using one of the supported platforms, then it is easy to get up and 
going fairly quickly as well.

...advice from another seigel/segel

Cheers
james.


On 2011-03-23, at 9:32 AM, Michael Segel wrote:

> 
> Rita,
> 
> It sounds like you're only using Hadoop and have no intentions to really get 
> into the internals.
> 
> I'm like most admins/developers/IT guys and I'm pretty lazy.
> I find it easier to set up the yum repository and then issue the yum install 
> hadoop command. 
> 
> The thing about Cloudera is that they do back port patches so that while 
> their release is 'heavily patched'.
> But they are usually in some sort of sync with the Apache release. Since 
> you're only working with HDFS and its pretty stable, I'd say go with the 
> Cloudera release.
> 
> HTH
> 
> -Mike
> 
> 
> 
>> Date: Wed, 23 Mar 2011 11:12:30 -0400
>> Subject: Re: CDH and Hadoop
>> From: rmorgan...@gmail.com
>> To: common-user@hadoop.apache.org
>> CC: michael_se...@hotmail.com
>> 
>> Mike,
>> 
>> Thanks. This helps a lot.
>> 
>> At our lab we have close to 60 servers which only run hdfs. I don't need
>> mapreduce and other bells and whistles. We just use hdfs for storing dataset
>> results ranging from 3gb to 90gb.
>> 
>> So, what is the best practice for hdfs? should I always deploy one version
>> before? I understand that Cloudera's version is heavily patched (similar to
>> Redhat Linux kernel versus standard Linux kernel).
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Mar 23, 2011 at 10:44 AM, Michael Segel
>> wrote:
>> 
>>> 
>>> Rita,
>>> 
>>> Short answer...
>>> 
>>> Cloudera's release is free, and they do also offer a support contract if
>>> you want support from them.
>>> Cloudera has sources, but most use yum (redhat/centos) to download an
>>> already built release.
>>> 
>>> Should you use it?
>>> Depends on what you want to do.
>>> 
>>> If your goal is to get up and running with Hadoop and then focus on *using*
>>> Hadoop/HBase/Hive/Pig/etc... then it makes sense.
>>> 
>>> If your goal is to do a deep dive in to Hadoop and get your hands dirty
>>> mucking around with the latest and greatest in trunk? Then no. You're better
>>> off building your own off the official Apache release.
>>> 
>>> Many companies choose Cloudera's release for the following reasons:
>>> * Paid support is available.
>>> * Companies focus on using a tech not developing the tech, so Cloudera does
>>> the heavy lifting while Client Companies focus on 'USING' Hadoop.
>>> * Cloudera's release makes sure that the versions in the release work
>>> together. That is that when you down load CHD3B4, you get a version of
>>> Hadoop that will work with the included version of HBase, Hive, etc ...
>>> 
>>> And no, its never a good idea to try and mix and match Hadoop from
>>> different environments and versions in a cluster.
>>> (I think it will barf on you.)
>>> 
>>> Does that help?
>>> 
>>> -Mike
>>> 
>>> 
>>> 
 Date: Wed, 23 Mar 2011 10:29:16 -0400
 Subject: CDH and Hadoop
 From: rmorgan...@gmail.com
 To: common-user@hadoop.apache.org
 
 I have been wondering if I should use CDH (
>>> http://www.cloudera.com/hadoop/)
 instead of the standard Hadoop distribution.
 
 What do most people use? Is CDH free? do they provide the tars or does it
 provide source code and I simply compile? Can I have some data nodes as
>>> CDH
 and the rest as regular Hadoop?
 
 
 I am asking this because so far I noticed a serious bug (IMO) in the
 decommissioning process (
 
>>> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/%3cAANLkTikPKGt5zw1QGLse+LPzUDP7Mom=ty_mxfcuo...@mail.gmail.com%3e
 )
 
 
 
 
 --
 --- Get your facts first, then you can distort them as you please.--
>>> 
>>> 
>> 
>> 
>> 
>> --
>> --- Get your facts first, then you can distort them as you please.--
> 



Hadoop in Canada

2011-03-29 Thread James Seigel
Hello,

You might remember me from a couple of weeks back asking if there were any 
Calgary people interested in a “meetup” about #bigdata or using hadoop.  Well, 
I’ve expanded my search a little to see if any of my Canadian brothers and 
sisters are using the elephant for good or for evil.  It might be harder to 
grab coffee, but it would be fun to see where everyone is.

Shout out if you’d like or ping me, I think it’d be fun to chat!

Cheers
James Seigel
Captain Hammer at Tynt.com

Re: Hadoop in Canada

2011-03-29 Thread James Seigel
I apologize posing it to user. 

Sorry. Forgot about General

Thanks J-D

James.
On 2011-03-29, at 11:33 AM, Jean-Daniel Cryans wrote:

> (moving to general@ since this is not a question regarding the usage
> of the hadoop commons, which I BCC'd)
> 
> I moved from Montreal to SF a year and a half ago because I saw two
> things 1) companies weren't interested (they are still trying to get
> rid of COBOL or worse) or didn't have the data to use Hadoop (not
> enough big companies) and 2) the universities were either uninterested
> or just amused by this "new comer". I know of one company that really
> does cool stuff with Hadoop in Montreal and it's Hopper
> (www.hopper.travel, they are still in closed alpha AFAIK) who also
> organized hackreduce.org last weekend. This is what their CEO has to
> say to the question "Is there something you would do differently now
> if you would start it over?":
> 
> Move to the Valley.
> 
> (see the rest here
> http://nextmontreal.com/product-market-fit-hopper-travel-fred-lalonde/)
> 
> I'm sure there are a lot of other companies that are either
> considering using or already using Hadoop to some extent in Canada
> but, like anything else, only a portion of them are interested in
> talking about it or even organizing an event.
> 
> I would actually love to see something getting organized and I'd be on
> the first plane to Y**, but I'm afraid that to achieve any sort of
> critical mass you'd have to fly in people from all the provinces. Air
> Canada becomes a SPOF :P
> 
> Now that I think about it, there's probably enough Canucks around here
> that use Hadoop that we could have our own little user group. If you
> want to have a nice vacation and geek out with us, feel free to stop
> by and say hi.
> 
> 
> 
> J-D
> 
> On Tue, Mar 29, 2011 at 6:21 AM, James Seigel  wrote:
>> Hello,
>> 
>> You might remember me from a couple of weeks back asking if there were any 
>> Calgary people interested in a “meetup” about #bigdata or using hadoop.  
>> Well, I’ve expanded my search a little to see if any of my Canadian brothers 
>> and sisters are using the elephant for good or for evil.  It might be harder 
>> to grab coffee, but it would be fun to see where everyone is.
>> 
>> Shout out if you’d like or ping me, I think it’d be fun to chat!
>> 
>> Cheers
>> James Seigel
>> Captain Hammer at Tynt.com



Re: Including Additional Jars

2011-04-04 Thread James Seigel
James’ quick and dirty, get your job running guideline:

-libjars <-- for jars you want accessible by the mappers and reducers
classpath or bundled in the main jar <-- for jars you want accessible to the 
runner

Cheers
James.



On 2011-04-04, at 12:31 PM, Shuja Rehman wrote:

> well...i think to put in distributed cache is good idea. do u have any
> working example how to put extra jars in distributed cache and how to make
> available these jars for job?
> Thanks
> 
> On Mon, Apr 4, 2011 at 10:20 PM, Mark Kerzner  wrote:
> 
>> I think you can put them either in your jar or in distributed cache.
>> 
>> As Allen pointed out, my idea of putting them into hadoop lib jar was
>> wrong.
>> 
>> Mark
>> 
>> On Mon, Apr 4, 2011 at 12:16 PM, Marco Didonna >> wrote:
>> 
>>> On 04/04/2011 07:06 PM, Allen Wittenauer wrote:
>>> 
 
 On Apr 4, 2011, at 8:06 AM, Shuja Rehman wrote:
 
 Hi All
> 
> I have created a map reduce job and to run on it on the cluster, i have
> bundled all jars(hadoop, hbase etc) into single jar which increases the
> size
> of overall file. During the development process, i need to copy again
>> and
> again this complete file which is very time consuming so is there any
>> way
> that i just copy the program jar only and do not need to copy the lib
> files
> again and again. i am using net beans to develop the program.
> 
> kindly let me know how to solve this issue?
> 
 
   This was in the FAQ, but in a non-obvious place.  I've updated it
 to be more visible (hopefully):
 
 
 
>> http://wiki.apache.org/hadoop/FAQ#How_do_I_submit_extra_content_.28jars.2C_static_files.2C_etc.29_for_my_job_to_use_during_runtime.3F
 
>>> 
>>> Does the same apply to jar containing libraries? Let's suppose I need
>>> lucene-core.jar to run my project. Can I put my this jar into my job jar
>> and
>>> have hadoop "see" lucene's classes? Or should I use distributed cache??
>>> 
>>> MD
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> 



Re: Error while using distcp

2011-04-18 Thread James Seigel
Same versions of hadoop in each cluster?

Sent from my mobile. Please excuse the typos.

On 2011-04-18, at 6:31 PM, sonia gehlot  wrote:

> Hi All,
>
> I am trying to copy files from one hadoop cluster to another hadoop cluster
> but I am getting following error:
>
> [phx1-rb-bi-dev50-metrics-qry1:]$ scripts/hadoop.sh distcp
> hftp://c17-dw-dev50-hdfs-dn-n1:50070/user/sgehlot/fact_lead.v0.txt.gz \
> hdfs://phx1-rb-dev40-pipe1.cnet.com:9000/user/sgehlot
> HADOOP_HOME: /home/sgehlot/cnwk-hadoop/hadoop/0.20.1
> HADOOP_CONF_DIR: /home/sgehlot/cnwk-hadoop/config/hadoop/0.20.1/conf_rb-dev
> *11/04/18 17:12:23 INFO tools.DistCp:
> srcPaths=[hftp://c17-dw-dev50-hdfs-dn-n1:50070/user/sgehlot/fact_lead.v0.txt.gz]
> 11/04/18 17:12:23 INFO tools.DistCp: destPath=hdfs://
> phx1-rb-dev40-pipe1.cnet.com:9000/user/sgehlot
> [Fatal Error] :1:186: XML document structures must start and end within the
> same entity.
> With failures, global counters are inaccurate; consider running with -i
> Copy failed: java.io.IOException: invalid xml directory content
> *at
> org.apache.hadoop.hdfs.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:239)
>at
> org.apache.hadoop.hdfs.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:244)
>at
> org.apache.hadoop.hdfs.HftpFileSystem.getFileStatus(HftpFileSystem.java:273)
>at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:689)
>at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:621)
>at org.apache.hadoop.tools.DistCp.copy(DistCp.java:638)
>at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>at org.apache.hadoop.tools.DistCp.main(DistCp.java:884)
> Caused by: org.xml.sax.SAXParseException: XML document structures must start
> and end within the same entity.
>at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1231)
>at
> org.apache.hadoop.hdfs.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:233)
>... 9 more
>
> Any idea why I am getting this.
>
> Thanks,
> Sonia


Re: HDFS permission denied

2011-04-24 Thread James Seigel
Check where the hadoop tmp setting is pointing to.

James

Sent from my mobile. Please excuse the typos.

On 2011-04-24, at 12:41 AM, "Peng, Wei"  wrote:

> Hi,
>
>
>
> I need a help very bad.
>
>
>
> I got an HDFS permission error by starting to run hadoop job
>
> org.apache.hadoop.security.AccessControlException: Permission denied:
>
> user=wp, access=WRITE, inode="":hadoop:supergroup:rwxr-xr-x
>
>
>
> I have the right permission to read and write files to my own hadoop
> user directory.
>
> It works fine when I use hadoop fs -put. The job input and output are
> all from my own hadoop user directory.
>
>
>
> It seems that when a job starts running, some data need to be written
> into some directory, and I don't have the permission to that directory.
> It is strange that the inode does not show which directory it is.
>
>
>
> Why does hadoop write something to a directory with my name secretly? Do
> I need to be set a particular user group?
>
>
>
> Many Thanks..
>
>
>
> Vivian
>
>
>
>
>


Re: Fixing a bad HD

2011-04-25 Thread James Seigel
Quicker:

Shut off power
Throw hard drive out put new one in
Turn power back on.

Sent from my mobile. Please excuse the typos.

On 2011-04-25, at 5:38 PM, Mayuran Yogarajah
 wrote:

> Hello,
>
> One of our nodes has a bad hard disk which needs to be replaced.  I'm 
> planning on doing the following:
> 1) Decommission the node
> 2) Replace the disk
> 3) Bring the node back into the cluster
>
> Is there a quicker/better way to address this? Please advise.
>
> thanks,
> M


Re: Fixing a bad HD

2011-04-25 Thread James Seigel
Good point. Advice without details can be tough.

Additional notes:  make sure you have three replicas and the blocks
are replicated. :)

Sent from my mobile. Please excuse the typos.

On 2011-04-25, at 7:04 PM, Brian Bockelman  wrote:

> Much quicker, but less safe: data might become inaccessible between boots if 
> you simultaneously lose another node.  Probably not an issue at 3 replicas, 
> but definitely an issue at 2.
>
> Brian
>
> On Apr 25, 2011, at 7:58 PM, James Seigel wrote:
>
>> Quicker:
>>
>> Shut off power
>> Throw hard drive out put new one in
>> Turn power back on.
>>
>> Sent from my mobile. Please excuse the typos.
>>
>> On 2011-04-25, at 5:38 PM, Mayuran Yogarajah
>>  wrote:
>>
>>> Hello,
>>>
>>> One of our nodes has a bad hard disk which needs to be replaced.  I'm 
>>> planning on doing the following:
>>> 1) Decommission the node
>>> 2) Replace the disk
>>> 3) Bring the node back into the cluster
>>>
>>> Is there a quicker/better way to address this? Please advise.
>>>
>>> thanks,
>>> M
>


Re: fair scheduler issue

2011-04-26 Thread James Seigel
I know cloudera has a bug in their version. They should have filed a
Jira for it.

Are you getting NPE in the logs?

James

Sent from my mobile. Please excuse the typos.

On 2011-04-26, at 6:53 AM, Saurabh bhutyani  wrote:

> Which version of hadoop are you referring to?
>
> Thanks & Regards,
> Saurabh Bhutyani
>
> Call  : 9820083104
> Gtalk: s4saur...@gmail.com
>
>
>
> On Tue, Apr 26, 2011 at 5:59 AM, hadoopman  wrote:
>
>> Has anyone had problems with the latest version of hadoop and the fair
>> scheduler not placing jobs into pools correctly?  We're digging into it
>> currently.  An older version of hadoop (using our config file) is working
>> fine however the latest version seems to be putting everything into the
>> default pool.
>>
>> Thoughts on where we can look?
>>
>> Thanks!
>>
>>


Re: -copyFromLocal , -put commands failing

2011-04-27 Thread James Seigel
Liliana,

Don’t worry so is Eric ;)

James.
On 2011-04-27, at 8:50 AM, Liliana Mamani Sanchez wrote:

> Hi Eric
> 
> You are right :S
> You must have realised I'm a complete hadoop newbie.
> 
> Cheers
> 
> On Wed, Apr 27, 2011 at 2:35 PM, Eric Fiala  wrote:
>> Liliana,
>> I might be completely off base, but this output from your hadoop fs -ls
>> appears to be directories - do you have anything inside these directories
>> (try hadoop dfs -lsr /user/hadoop/jojolete to list recursively)
>> 
>> EF
>> 
>> On 27 April 2011 07:25, Liliana Mamani Sanchez  wrote:
>> 
>>> Hello all,
>>> 
>>> Actually, I realised about this problem when trying to use Mahout,
>>> trying to create vectors using the
>>> 
>>> $MAHOUT_HOME/bin/mahout seqdirectory
>>> 
>>> in my case:
>>> 
>>> $MAHOUT_HOME/bin/mahout seqdirectory -i
>>> /home/hadoop/programming/docs-test/ -o  jojolete --charset ascii
>>> 
>>> command. This was not producing any output and after trying a lot to
>>> find the reason, I thought it was because I had first to copy the
>>> files to hadoop. Then I issued
>>> 
>>> 
>>> dfs -copyFromLocal ../../programming/docs-test docs-test-maho
>>> 
>>> and I realised it "copies" the files but to zero byte files and I even
>>> spotted the "jojolete" output from mahout command and this is zero
>>> sized as well
>>> 
>>> 
>>> $HADOOP_HOME/bin/hadoop dfs -ls
>>> Found 5 items
>>> drwxr-xr-x   - hadoop supergroup  0 2011-04-27 11:52
>>> /user/hadoop/docs-test-mah
>>> drwxr-xr-x   - hadoop supergroup  0 2011-04-27 12:00
>>> /user/hadoop/docs-test-maho
>>> drwxr-xr-x   - hadoop supergroup  0 2011-04-27 13:30
>>> /user/hadoop/gutenberg
>>> drwxr-xr-x   - hadoop supergroup  0 2011-04-26 17:46
>>> /user/hadoop/jojolete
>>> drwxr-xr-x   - hadoop supergroup  0 2011-04-26 14:26
>>> /user/hadoop/testing-output
>>> 
>>> I don't know if I'm doing something wrong, the commands never return
>>> any failure message. Maybe some of you have had a similar experience.
>>> I hope you can help me
>>> 
>>> Cheers
>>> 
>>> Liliana
>>> 
>>> 
>>> 
>>> --
>>> Liliana Paola Mamani Sanchez
>>> 
>> 
> 
> 
> 
> -- 
> Liliana Paola Mamani Sanchez



Re: number of maps it lower than the cluster capacity

2011-05-01 Thread James Seigel
Also an input split size thing in there as well.  But definitely # of
mappers are proportional to input data size

Sent from my mobile. Please excuse the typos.

On 2011-05-01, at 11:26 AM, ShengChang Gu  wrote:

> If I'm not mistaken,then the map slots = input data size / block size.
>
> 2011/5/2 Guy Doulberg 
>
>> Hey,
>> Maybe someone can give an idea where to look for the bug...
>>
>> I have a cluster with 270 slots for mappers,
>> And a fairSchedualer configured for it
>>
>> Sometimes this cluster allocates only 80 or 50 slots to the entire cluster.
>> Most of the time the most of slots are allocated .
>>
>> I noticed that there are special several jobs that cause the above
>> behavior,
>>
>> Thanks Guy,
>>
>>
>>
>
>
> --
> 阿昌


Re: number of maps it lower than the cluster capacity

2011-05-02 Thread James Seigel
Maybe throw the output of the startup of your job on a gist and share
it and then we can maybe help some more.

The other thing to look at is how fast your mappers are completing.

If they are completing in under let's say 10 seconds then the job
tracker is "thrashing" and not able to allocate mappers fast enough.
You could then adjust the input split siZe for your jibs a bit and
slow the mappers down so the jobtracker can keep up

James

Sent from my mobile. Please excuse the typos.

On 2011-05-02, at 12:08 AM, Guy Doulberg  wrote:

> I know that...
> Map allocations for a job is proportional to size of input data,
> But in my case
> I have a job that runs with 200 map slots
> And a specific job that when it runs, the job tracker allocates to it 60 
> mappers (although I think is should get more according to my calculations) 
> and 30 to the job that was already running.
> What happened to the other 110 map slots? Why aren't they allocated?
>
>
> Thanks, Guy
>
>
>
> -Original Message-
> From: James Seigel [mailto:ja...@tynt.com]
> Sent: Sunday, May 01, 2011 9:42 PM
> To: common-user@hadoop.apache.org
> Subject: Re: number of maps it lower than the cluster capacity
>
> Also an input split size thing in there as well.  But definitely # of
> mappers are proportional to input data size
>
> Sent from my mobile. Please excuse the typos.
>
> On 2011-05-01, at 11:26 AM, ShengChang Gu  wrote:
>
>> If I'm not mistaken,then the map slots = input data size / block size.
>>
>> 2011/5/2 Guy Doulberg 
>>
>>> Hey,
>>> Maybe someone can give an idea where to look for the bug...
>>>
>>> I have a cluster with 270 slots for mappers,
>>> And a fairSchedualer configured for it
>>>
>>> Sometimes this cluster allocates only 80 or 50 slots to the entire cluster.
>>> Most of the time the most of slots are allocated .
>>>
>>> I noticed that there are special several jobs that cause the above
>>> behavior,
>>>
>>> Thanks Guy,
>>>
>>>
>>>
>>
>>
>> --
>> 阿昌


Re: HDFS - MapReduce coupling

2011-05-02 Thread James Seigel
If you are pressed for time, you could look at the source code.  I
believe a huge proportion of the people that could answer your
question ( and it isn't a small one ) are sleeping right now. :)

Source code is probably your best answer.

James

Sent from my mobile. Please excuse the typos.

On 2011-05-02, at 5:23 AM, Matthew John  wrote:

> someone kindly give some pointers on this!!
>
> On Mon, May 2, 2011 at 12:46 PM, Matthew John 
> wrote:
>
>> Any documentations on how the different daemons do the write/read on HDFS
>> and Local File System (direct), I mean the different protocols used in the
>> interactions. I basically wanted to figure out how intricate the coupling
>> between the Storage (HDFS + Local) and other processes in the Hadoop
>> infrastructure is.
>>
>>
>>
>> On Mon, May 2, 2011 at 12:26 PM, Ted Dunning wrote:
>>
>>> Yes.  There is quite a bit of need for the local file system in clustered
>>> mode.
>>>
>>> For one think, all of the shuffle intermediate files are on local disk.
>>> For
>>> another, the distributed cache is actually stored on local disk.
>>>
>>> HFDS is a frail vessel that cannot cope with all the needs.
>>>
>>> On Sun, May 1, 2011 at 11:48 PM, Matthew John >>> wrote:
>>>
 ...
 2) Does the Hadoop system utilize the local storage directly for any
 purpose
 (without going through the HDFS) in clustered mode?


>>>
>>
>>


Re: so many failures on reducers.

2011-05-02 Thread James Seigel
What are your permissions on your hadoop.tmp.dir ?

James

Sent from my mobile. Please excuse the typos.

On 2011-05-02, at 1:26 AM, Jun Young Kim  wrote:

> hi, all.
>
> I got so many failures on a reducing step.
>
> see this error.
>
> java.io.IOException: Failed to delete earlier output of task: 
> attempt_201105021341_0021_r_01_0
>at 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:157)
>at 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:173)
>at 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:173)
>at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:133)
>at 
> org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:233)
>at org.apache.hadoop.mapred.Task.commit(Task.java:962)
>at org.apache.hadoop.mapred.Task.done(Task.java:824)
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391)
>at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
>at org.apache.hadoop.mapred.C
>
>
> this error was happened after adopting MultipleTextOutputFormat class in my 
> job.
> the job is producing thousands of different output files on a HDFS.
>
> anybody can guess reasons?
>
> thanks.
>
> --
> Junyoung Kim (juneng...@gmail.com)
>


Re: Configuration for small Cluster

2011-05-02 Thread James Seigel
Do you see swapping on your data nodes with this config?

James

Sent from my mobile. Please excuse the typos.

On 2011-05-02, at 5:38 AM, baran cakici  wrote:

> any comments???
>
> 2011/4/28 baran cakici 
>
>> Hi Everyone,
>>
>> I have a Cluster with one Master(JobTracker and NameNode - Intel Core2Duo 2
>> GB Ram) and four Slaves(Datanode and Tasktracker - Celeron 2 GB Ram). My
>> Inputdata are between 2GB-10GB and I read Inputdata in MapReduce line by
>> line. Now, I try to accelerate my System(Benchmark), but I'm not sure, if my
>> Configuration is correctly. Can you please just look, if it is ok?
>>
>> -mapred-site.xml
>>
>> 
>> mapred.job.tracker
>> apple:9001
>> 
>>
>> 
>> mapred.child.java.opts
>> -Xmx512m -server
>> 
>>
>> 
>> mapred.job.tracker.handler.count
>> 2
>> 
>>
>> 
>> mapred.local.dir
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/local
>> 
>>
>> 
>> mapred.map.tasks
>> 1
>> 
>>
>> 
>> mapred.reduce.tasks
>> 4
>> 
>>
>> 
>> mapred.submit.replication
>> 2
>> 
>>
>> 
>> mapred.system.dir
>>
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system
>> 
>>
>> 
>> mapred.tasktracker.indexcache.mb
>> 10
>> 
>>
>> 
>> mapred.tasktracker.map.tasks.maximum
>> 1
>> 
>>
>> 
>> mapred.tasktracker.reduce.tasks.maximum
>> 4
>> 
>>
>> 
>> mapred.temp.dir
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/temp
>> 
>>
>> 
>> webinterface.private.actions
>> true
>> 
>>
>> 
>> mapred.reduce.slowstart.completed.maps
>> 0.01
>> 
>>
>> -hdfs-site.xml
>>
>> 
>> dfs.block.size
>> 268435456
>> 
>> PS: I extended dfs.block.size, because I won 50% better performance with
>> this change.
>>
>> I am waiting for your comments...
>>
>> Regards,
>>
>> Baran
>>


Re: Configuration for small Cluster

2011-05-02 Thread James Seigel
I am talking about unix swapping of memory out to disk when the Os
runs low on ram.  If it is doing this you will get abysmal
performance.

With small amount of ram in your boxes, and you number of mappers and
reduces totaling 5 with child opts of 512mb you have a scenario when
you can OOM the box.

To be fair I didn't go too deep into your settings.

James

Sent from my mobile. Please excuse the typos.

On 2011-05-02, at 6:58 AM, baran cakici  wrote:

> Hi James,
>
> Thank you for your response... What do you mean with "swapping"? Each Node
> works as a Tasktracker and Datanode. I can see it, if your question is
> this...
> Regards,
>
> Baran
> 2011/5/2 James Seigel 
>
>> Do you see swapping on your data nodes with this config?
>>
>> James
>>
>> Sent from my mobile. Please excuse the typos.
>>
>> On 2011-05-02, at 5:38 AM, baran cakici  wrote:
>>
>>> any comments???
>>>
>>> 2011/4/28 baran cakici 
>>>
>>>> Hi Everyone,
>>>>
>>>> I have a Cluster with one Master(JobTracker and NameNode - Intel
>> Core2Duo 2
>>>> GB Ram) and four Slaves(Datanode and Tasktracker - Celeron 2 GB Ram). My
>>>> Inputdata are between 2GB-10GB and I read Inputdata in MapReduce line by
>>>> line. Now, I try to accelerate my System(Benchmark), but I'm not sure,
>> if my
>>>> Configuration is correctly. Can you please just look, if it is ok?
>>>>
>>>> -mapred-site.xml
>>>>
>>>> 
>>>> mapred.job.tracker
>>>> apple:9001
>>>> 
>>>>
>>>> 
>>>> mapred.child.java.opts
>>>> -Xmx512m -server
>>>> 
>>>>
>>>> 
>>>> mapred.job.tracker.handler.count
>>>> 2
>>>> 
>>>>
>>>> 
>>>> mapred.local.dir
>>>>
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/local
>>>> 
>>>>
>>>> 
>>>> mapred.map.tasks
>>>> 1
>>>> 
>>>>
>>>> 
>>>> mapred.reduce.tasks
>>>> 4
>>>> 
>>>>
>>>> 
>>>> mapred.submit.replication
>>>> 2
>>>> 
>>>>
>>>> 
>>>> mapred.system.dir
>>>>
>>>>
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system
>>>> 
>>>>
>>>> 
>>>> mapred.tasktracker.indexcache.mb
>>>> 10
>>>> 
>>>>
>>>> 
>>>> mapred.tasktracker.map.tasks.maximum
>>>> 1
>>>> 
>>>>
>>>> 
>>>> mapred.tasktracker.reduce.tasks.maximum
>>>> 4
>>>> 
>>>>
>>>> 
>>>> mapred.temp.dir
>>>>
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/temp
>>>> 
>>>>
>>>> 
>>>> webinterface.private.actions
>>>> true
>>>> 
>>>>
>>>> 
>>>> mapred.reduce.slowstart.completed.maps
>>>> 0.01
>>>> 
>>>>
>>>> -hdfs-site.xml
>>>>
>>>> 
>>>> dfs.block.size
>>>> 268435456
>>>> 
>>>> PS: I extended dfs.block.size, because I won 50% better performance with
>>>> this change.
>>>>
>>>> I am waiting for your comments...
>>>>
>>>> Regards,
>>>>
>>>> Baran
>>>>
>>


Re: Anyone using layer 2 vlans in their cluster?

2011-05-02 Thread James Seigel
I think we have Mr. Segel. I could dig more with the it guys if you're
interested.

Sent from my mobile. Please excuse the typos.

On 2011-05-02, at 8:03 AM, Michael Segel  wrote:

>
> Hi,
>
> This may be a bit esoteric but I was thinking about clusters and cluster 
> growth within a machine room.
>
> It may not be possible to get contiguous rack space for a very large cluster. 
> I was wondering if anyone had built out a large cluster and had to implement 
> a vlan (layer 2 virtual lan) to make the cluster 'look and feel' as if the 
> machines were racked and co-located together?
>
> Thx
>
> -Mike
>


Re: Configuration for small Cluster

2011-05-02 Thread James Seigel
Sorry, I assumed Linux.

James

Sent from my mobile. Please excuse the typos.

On 2011-05-02, at 8:15 AM, Richard Nadeau  wrote:

> Are you running under cygwin on your data nodes as well? That is certain to
> cause performance problems. As James suggested, swapping to disk is going to
> be a killer, running on Windows with Celeron processors only compounds the
> problem. The Celeron processor is also sub-optimal for CPU intensive tasks
>
> Rick
>
> On Apr 28, 2011 9:22 AM, "baran cakici"  wrote:
>> Hi Everyone,
>>
>> I have a Cluster with one Master(JobTracker and NameNode - Intel Core2Duo
> 2
>> GB Ram) and four Slaves(Datanode and Tasktracker - Celeron 2 GB Ram). My
>> Inputdata are between 2GB-10GB and I read Inputdata in MapReduce line by
>> line. Now, I try to accelerate my System(Benchmark), but I'm not sure, if
> my
>> Configuration is correctly. Can you please just look, if it is ok?
>>
>> -mapred-site.xml
>>
>> 
>> mapred.job.tracker
>> apple:9001
>> 
>>
>> 
>> mapred.child.java.opts
>> -Xmx512m -server
>> 
>>
>> 
>> mapred.job.tracker.handler.count
>> 2
>> 
>>
>> 
>> mapred.local.dir
>>
> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/local
>> 
>>
>> 
>> mapred.map.tasks
>> 1
>> 
>>
>> 
>> mapred.reduce.tasks
>> 4
>> 
>>
>> 
>> mapred.submit.replication
>> 2
>> 
>>
>> 
>> mapred.system.dir
>>
> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system
>> 
>>
>> 
>> mapred.tasktracker.indexcache.mb
>> 10
>> 
>>
>> 
>> mapred.tasktracker.map.tasks.maximum
>> 1
>> 
>>
>> 
>> mapred.tasktracker.reduce.tasks.maximum
>> 4
>> 
>>
>> 
>> mapred.temp.dir
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/temp
>> 
>>
>> 
>> webinterface.private.actions
>> true
>> 
>>
>> 
>> mapred.reduce.slowstart.completed.maps
>> 0.01
>> 
>>
>> -hdfs-site.xml
>>
>> 
>> dfs.block.size
>> 268435456
>> 
>> PS: I extended dfs.block.size, because I won 50% better performance with
>> this change.
>>
>> I am waiting for your comments...
>>
>> Regards,
>>
>> Baran


  1   2   >