RE: Storing Custom Java Objects in Hadoop Distibuted Cache

2010-03-17 Thread Sanjay Sharma
Hi Ninad, You can always use Java object serialization to store custom objects as files in Hadoop distributed cache before map/reducer start running. The thumb rule steps of such usage is- a. Create the object while configuring your job, serialize it to a file and put it is distributed cache b.

Distributed hadoop setup 0 live datanode problem in cluster

2010-03-17 Thread William Kang
Hi, I just moved from pseudo distributed hadoop to a four machine full distributed hadoop setup. But, after I start the dfs, there is no live node showing up. If I make master a slave too, then the datanode in master machine will show up. I looked up all logs and found no errors. The only thing

Re: Distributed hadoop setup 0 live datanode problem in cluster

2010-03-17 Thread Jeff Zhang
Can you post your namenode's log ? It seems that your data node can not connect to the name node. On Wed, Mar 17, 2010 at 2:43 PM, William Kang weliam.cl...@gmail.comwrote: Hi, I just moved from pseudo distributed hadoop to a four machine full distributed hadoop setup. But, after I start

Re: Distributed hadoop setup 0 live datanode problem in cluster

2010-03-17 Thread William Kang
Hi Jeff, Here is the log from my namenode: / 2010-03-17 03:09:59,750 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode

Re: Distributed hadoop setup 0 live datanode problem in cluster

2010-03-17 Thread William Kang
Hi Jeff, I think I partly found out the reasons of this problem. The /etc/hosts 127.0.0.1 has the master's host name in it. And the namenode took 127.0.0.1 as the ip address of the namenode. I fixed it and I already found two nodes. There is one still missing. I will let you guys know what

optimization help needed

2010-03-17 Thread Reik Schatz
Preparing a Hadoop presentation here. For demonstration I start up a 5 machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over into HDFS. The Mapper will receive a XML block as the key, select a email address

Fwd: Google Research: MapReduce: The programming model and practice

2010-03-17 Thread Edward J. Yoon
Just FWD. -- Forwarded message -- From: Edward J. Yoon edwardy...@apache.org Date: Wed, Mar 17, 2010 at 5:47 PM Subject: Google Research: MapReduce: The programming model and practice To: hama-...@incubator.apache.org FYI, http://research.google.com/pubs/pub36249.html -- Best

Sqoop Installation on Apache Hadop 0.20.2

2010-03-17 Thread Utku Can Topçu
Dear All, I'm trying to run tests using MySQL as some kind of a datasource, so I thought cloudera's sqoop would be a nice project to have in the production. However, I'm not using the cloudera's hadoop distribution right now, and actually I'm not thinking of switching from a main project to a

Re: Sqoop Installation on Apache Hadop 0.20.2

2010-03-17 Thread Reik Schatz
At least for MRUnit, I was not able to find it outside of the Cloudera distribution (CDH). What I did: installing CDH locally using apt (Ubuntu), searched for and copied the mrunit library into my local Maven repository, and removed CDH after. I guess the same is somehow possible for Sqoop.

Measuring running times

2010-03-17 Thread Antonio D'Ettole
Hi everybody, as part of my project work at school I'm running some Hadoop jobs on a cluster. I'd like to measure exactly how long each phase of the process takes: mapping, shuffling (ideally divided in copying and sorting) and reducing. The tasktracker logs do not seem to supply the start/end

Re: Storing Custom Java Objects in Hadoop Distibuted Cache

2010-03-17 Thread Ninad Raut
These are good inputs Sanjay. Thanks for the help. On Wed, Mar 17, 2010 at 11:33 AM, Sanjay Sharma sanjay.sha...@impetus.co.in wrote: Hi Ninad, You can always use Java object serialization to store custom objects as files in Hadoop distributed cache before map/reducer start running. The

Re: optimization help needed

2010-03-17 Thread Gang Luo
HI, you can control the number of reducers by JobConf.setNumReduceTasks(n). The number of mappers is defined by (file size) / (split size). By default the split size is 64MB. Since you dataset is not very large, there should be no big difference if you change these. if you are only interested

Re: optimization help needed

2010-03-17 Thread Reik Schatz
Very good input not to sent the original xml over to the reducers. For the JobConf.setNumReduceTasks(n) isn't that just a hint but the real number will be determined based on the Partitioner I use, which will be the default HashPartioner? One other thought I had, what will happen if the values

Re: optimization help needed

2010-03-17 Thread Gang Luo
Hi Reik, the number of reducer is not a hint (mappers # is a hint). The default hash partitioner will hash and sent records to each reducer in round-robin way based on the reducers #. If the values list is too large to fit into heap memory, then you will get an exception and job will fail

Re: Slave data node failing to connect?

2010-03-17 Thread Kane, David
Folks, Does anyone know if this earlier post ever reached a resolution? I am trying to work through the same tutorial, and I have encountered the same issue. Of the candidate problems Jason suggested, none of them seem to pan out in my case (details below). I'm looking for suggestions as to

Re: optimization help needed

2010-03-17 Thread Reik Schatz
Thanks Gang, I will do some testing tomorrow - skip sending whole XML, maybe adding some Reducers - and see where I end up. Gang Luo wrote: Hi Reik, the number of reducer is not a hint (mappers # is a hint). The default hash partitioner will hash and sent records to each reducer in

Re: Trashbin is not recycled

2010-03-17 Thread Marcus Herou
Thanks On Mon, Mar 15, 2010 at 10:25 AM, Rekha Joshi rekha...@yahoo-inc.comwrote: ..dfs -rmr -skipTrash /user/hadoop/.Trash recreates .Trash, on consecutive rmr...-skipTrash can be used generally if you don't want a backup of deletes, here only to illustrate.. On 3/15/10 2:43 PM, Marcus

Re: Measuring running times

2010-03-17 Thread Simone Leo
At the default log level, Hadoop job logs (the ones you also get in the job's output directory under _logs/history) contain entries like the following: ReduceAttempt TASK_TYPE=REDUCE TASKID=tip_200809020551_0008_r_02 TASK_ATTEMPT_ID=task_200809020551_0008_r_02_0 START_TIME=1220331166789

Re: Measuring running times

2010-03-17 Thread Owen O'Malley
On Mar 17, 2010, at 4:47 AM, Antonio D'Ettole wrote: Hi everybody, as part of my project work at school I'm running some Hadoop jobs on a cluster. I'd like to measure exactly how long each phase of the process takes: mapping, shuffling (ideally divided in copying and sorting) and reducing.

Is there an easy way to clear old jobs from the jobtracker webpage?

2010-03-17 Thread Raymond Jennings III
I'd like to be able to clear the contents of the jobs that have completed running on the jobtracker webpage. Is there an easy way to do this without restarting the cluster?

Austin Hadoop Users Group - Tomorrow Evening (Thursday)

2010-03-17 Thread Stephen Watt
Hi Folks The Austin HUG is meeting tomorrow night. I hope to see you there. We have speakers from Rackspace (Stu Hood on Cassandra) and IBM (Gino Bustelo on BigSheets). Detailed Information is available at http://austinhug.blogspot.com/ Kind regards Steve Watt

when to sent distributed cache file

2010-03-17 Thread Gang Luo
Hi all, I doubt when does hadoop distributes the cache files. The moment we call DistributedCache.addCacheFile() ? Will the time to distribute caches be counted as part of the mapreduce job time? Thanks, -Gang

Re: Austin Hadoop Users Group - Tomorrow Evening (Thursday)

2010-03-17 Thread Alexandre Jaquet
Hi, Please let me know if you wil publish any kind of document, presentation, video and else Thanks in advance Alexandre Jaquet 2010/3/17 Stephen Watt sw...@us.ibm.com Hi Folks The Austin HUG is meeting tomorrow night. I hope to see you there. We have speakers from Rackspace (Stu Hood on

Re: WritableName can't load class in hive

2010-03-17 Thread Arvind Prabhakar
[cross posting to hive-user] Oded - how did you create the table in Hive? Did you specify any row format SerDe for the table? If not, then that may be the cause of this problem since the default LazySimpleSerDe is unable to deserialize the custom Writable key value pairs that you have used in

Re: when to sent distributed cache file

2010-03-17 Thread Gang Luo
Thanks Ravi. Here are some observations. I run job1 to generate some data used by the following job2 without replication. The total size of the job 1 output is 25mb and is in 50 files. I use distributed cache to sent all the files to nodes running job2 tasks. When job2 starts, it stayed at map

RE: WritableName can't load class in hive

2010-03-17 Thread Oded Rotem
No, I didn't specify any SerDe. I'll read up on that and see if it works. Thanks. -Original Message- From: Arvind Prabhakar [mailto:arv...@cloudera.com] Sent: Wednesday, March 17, 2010 10:40 PM To: common-user@hadoop.apache.org; hive-u...@hadoop.apache.org Subject: Re: WritableName

Re: Measuring running times

2010-03-17 Thread Antonio D'Ettole
At the default log level, Hadoop job logs (the ones you also get in the job's output directory under _logs/history) Thanks Simone, that's exactly what I was looking for. Look at the job history logs. They break down the times for each task I understand you guys are talking about the same

Re: Is there a way to suppress the attempt logs?

2010-03-17 Thread Bill Graham
Not sure if what you're asking is possible or not, but you could experiment with these params to see if you could achieve a similar effect. property namemapred.userlog.limit.kb/name value0/value descriptionThe maximum size of user-logs of each task in KB. 0 disables the cap. /description

Re: Is there a way to suppress the attempt logs?

2010-03-17 Thread Arun C Murthy
Moving to mapreduce-user@ On Mar 15, 2010, at 5:54 PM, abhishek sharma wrote: Hi all, Hadoop creates a directory (and some files) for each map and reduce task attempts in logs/userlogs on each tasktracker. Is there a way to configure Hadoop not to create these attempt logs? Not really,

Re: Sqoop Installation on Apache Hadop 0.20.2

2010-03-17 Thread Aaron Kimball
Hi Utku, Apache Hadoop 0.20 cannot support Sqoop as-is. Sqoop makes use of the DataDrivenDBInputFormat (among other APIs) which are not shipped with Apache's 0.20 release. In order to get Sqoop working on 20, you'd need to apply a lengthy list of patches from the project source repository to your

Re: hadoop under cygwin issue

2010-03-17 Thread Brian Wolf
Alex Kozlov wrote: Hi Brian, Is your namenode running? Try 'hadoop fs -ls /'. Alex On Mar 12, 2010, at 5:20 PM, Brian Wolf brw...@gmail.com wrote: Hi Alex, I am back on this problem. Seems it works, but I have this issue with connecting to server. I can connect 'ssh localhost' ok.