Automate Hadoop installation

2011-12-05 Thread praveenesh kumar
Hi all, Can anyone guide me how to automate the hadoop installation/configuration process? I want to install hadoop on 10-20 nodes which may even exceed to 50-100 nodes ? I know we can use some configuration tools like puppet/or shell-scripts ? Has anyone done it ? How can we do hadoop installati

RE: Automate Hadoop installation

2011-12-05 Thread Sagar Shukla
Hi Praveenesh, I had created VMs images with OS /hadoop nodes pre-configured which I would start as per requirement. But if you plan to do at the hardware level then Linux provides with kickstart type of configuration, which allows OS / Package installations automatically (network conf

Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion

2011-12-05 Thread Jignesh Patel
I am running 64bit version. Have you setup SSH properly? On Dec 3, 2011, at 2:30 AM, Will L wrote: > > > I am using 64-Bit Eclipse 3.7.1 Cocoa with Hadoop 0.20.205.0. I get the > following error message: > An internal error occurred during: "Connecting to DFS localhost". > org/apache/commons/co

Multiple Mappers for Multiple Tables

2011-12-05 Thread Justin Vincent
I would like join some db tables, possibly from different databases, in a MR job. I would essentially like to use MultipleInputs, but that seems file oriented. I need a different mapper for each db table. Suggestions? Thanks! Justin Vincent

Re: Automate Hadoop installation

2011-12-05 Thread Konstantin Boudnik
These that great project called BigTop (in the apache incubator) which provides for building of Hadoop stack. The part of what it provides is a set of Puppet recipes which will allow you to do exactly what you're looking for with perhaps some minor corrections. Serious, look at Puppet - otherwise

Hadoop Profiling

2011-12-05 Thread Bai Shen
I turned on the profiling in Hadoop, and the MapReduceTutorial at http://hadoop.apache.org/common/docs/current/mapred_tutorial.html says that the profile files should go to the user log directory. However, they're currently going to the working directory where I start the hadoop job from. I've se

Re: Multiple Mappers for Multiple Tables

2011-12-05 Thread Bejoy Ks
Justin If I get your requirement right you need to get in data from multiple rdbms sources and do a join on the same, also may be some more custom operations on top of this. For this you don't need to go in for writing your custom mapreduce code unless it is that required. You can achieve t

Pig Output

2011-12-05 Thread Aaron Griffith
Using PigStorage() my pig script output gets put into partial files on the hadoop file system. When I use the copyToLocal fuction from Hadoop it creates a local directory with all the partial files. Is there a way to copy the partial files from hadoop into a single local file? Thanks

Re: Multiple Mappers for Multiple Tables

2011-12-05 Thread Bejoy Ks
Hi Justin, Just to add on to my response. If you need to fetch data from rdbms on your mapper using your custom mapreduce code you can use the DBInputFormat in your mapper class with MultipleInputs. You have to be careful in using the number of mappers for your application as dbs would

Re: Pig Output

2011-12-05 Thread Bejoy Ks
Hi Aaron Instead of copyFromLocal use getmerge. It would do your job. The syntax for CLI is hadoop fs -getmerge /xyz.txt Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 1:57 AM, Aaron Griffith wrote: > Using PigStorage() my pig script output gets put into partial files on

MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-05 Thread Chris Curtin
Hi, Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, 8 node cluster, 64 bit Centos We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer jobs. When we investigate it looks like the TaskTracker on the node being fetched from is not running. Looking at th

Running a job continuously

2011-12-05 Thread burakkk
Hi everyone, I want to run a MR job continuously. Because i have streaming data and i try to analyze it all the time in my way(algorithm). For example you want to solve wordcount problem. It's the simplest one :) If you have some multiple files and the new files are keep going, how do you handle it

Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-05 Thread Todd Lipcon
Hi Chris, I'd suggest updating to a newer version of your hadoop distro - you're hitting some bugs that were fixed last summer. In particular, you're missing the "amendment" patch from MAPREDUCE-2373 as well as some patches to MR which make the fetch retry behavior more aggressive. -Todd On Mon,

Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-05 Thread Bejoy Ks
Hi Chris From the stack trace, it looks like a JVM corruption issue. It is a known issue and have been fixed in CDH3u2, i believe an upgrade would solve your issues. https://issues.apache.org/jira/browse/MAPREDUCE-3184 Then regarding your queries,I'd try to help you out a bit.In mapreduce

Re: Running a job continuously

2011-12-05 Thread Bejoy Ks
Burak If you have a continuous inflow of data, you can choose flume to aggregate the files into larger sequence files or so if they are small and when you have a substantial chunk of data(equal to hdfs block size). You can push that data on to hdfs based on your SLAs you need to schedule you

Re: Running a job continuously

2011-12-05 Thread Mike Spreitzer
Burak, Before we can really answer your question, you need to give us some more information on the processing you want to do. Do you want output that is continuous or batched (if so, how)? How should the output at a given time be related to the input up to then and the previous outputs? Regar

Re: Pig Output

2011-12-05 Thread Russell Jurney
hadoop dfs cat /my/path/* > single_file Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On Dec 5, 2011, at 12:30 PM, Aaron Griffith wrote: > Using PigStorage() my pig script output gets put into partial files on the > hadoop > file system. > > When I use the copyTo

Re: Running a job continuously

2011-12-05 Thread John Conwell
You might also want to take a look at Storm, as thats what its design to do: https://github.com/nathanmarz/storm/wiki On Mon, Dec 5, 2011 at 1:34 PM, Mike Spreitzer wrote: > Burak, > Before we can really answer your question, you need to give us some more > information on the processing you want

Re: Multiple Mappers for Multiple Tables

2011-12-05 Thread Justin Vincent
Thanks Bejoy, I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a Path parameter. Are these paths just ignored here? On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks wrote: > Hi Justin, >Just to add on to my response. If you need to fetch data from > rdbms on your mapp

Re: Running a job continuously

2011-12-05 Thread burakkk
Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to execute the MR job on the same algorithm but different files have different velocity. Both Storm and facebook's hadoop are designed for that. But i want to use apache distribution. Bejoy Ks, i have a continuous inflow of da

Re: Running a job continuously

2011-12-05 Thread Abhishek Pratap Singh
Hi Burak, The model of hadoop is very different, it is based on Job based model, in more easy words its a kind of Batch model where map reduce job is executed on a batch of data which is already present. As per your requirement, word count example doesn't make sense if the file has been written co

RE: Running a job continuously

2011-12-05 Thread Ravi teja ch n v
Hi Burak, >Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Just to add to Bejoy's point, with Oozie, you can specify the data dependency for running your job. When specific amount of data is in, your can configure Oozie to run your job. I think this will

Re: Availability of Job traces or logs

2011-12-05 Thread Amar Kamat
Arun, > I want to test its behaviour under different size of jobs traces(meaning > number of jobs say 5,10,25,50,100) under different > number of nodes. > Till now i was using only the test/data given by mumak which has 19 jobs and > 1529 node topology. I don' have many nodes > with me to run som

Re: Automate Hadoop installation

2011-12-05 Thread alo alt
Hi, to deploy software I suggest pulp: https://fedorahosted.org/pulp/wiki/HowTo For a package-based distro (debian, redhat, centos) you can build apache's hadoop, pack it and delpoy. Configs, as Cos say, over puppet. If you use a redhat / centos take a look at spacewalk. best, Alex On Mon, De