Frequent Namespace ID exceptions

2010-03-07 Thread bharath v
Hi all, I am getting frequent Namespace Id exceptions . I am running hadoop 0.20.0 on a cluster of 8 machines. Data nodes work properly for some time and then they stop automatically logging an error of NameSpaceID exception . Iam manually deleting namespace data , formatting name node and restart

Re: Security issue: hadoop fs shell bypass authentication?

2010-03-07 Thread Allen Wittenauer
On 3/6/10 10:41 PM, "jiang licht" wrote: > I can feel that pain, Kerberos needs you to pull more hair from your head :) I > worked on it a while back and now only remember bit of it. The only other real choice is PKI. CRLs? Blech. I'd much rather tie the grid into my pre-existing Active Dir

Re: Namenode problem

2010-03-07 Thread Eason.Lee
2010/3/8 William Kang > Hi guys, > Thanks for your replies. I did not put anything in /tmp. It's just that > default setting of dfs.name.dir/dfs.data.dir is set to the subdir in /tmp every time when I restart the hadoop, the localhost:50070 does not show up. > The localhost:50030 is fine. Unles

Re: Namenode problem

2010-03-07 Thread William Kang
Hi guys, Thanks for your replies. I did not put anything in /tmp. It's just that every time when I restart the hadoop, the localhost:50070 does not show up. The localhost:50030 is fine. Unless I reformat namenode, I wont be able to see the HDFS' web page at 50070. It did not clean /tmp automaticall

RE: Namenode problem

2010-03-07 Thread sagar_shukla
Hi William, Can you provide a snapshot of the log-file log/hadoop-hadoop-namenode.log file when start of service fails on reboot of machine ? Also what does your configuration look like ? Thanks, Sagar -Original Message- From: William Kang [mailto:weliam.cl...@gmail.com] Sent: Mon

Re: Namenode problem

2010-03-07 Thread Bradford Stephens
Yeah. Don't put things in /tmp. That's unpleasant in the long run. On Sun, Mar 7, 2010 at 9:36 PM, Eason.Lee wrote: > Your /tmp directory is cleaned automaticly? > > Try to set dfs.name.dir/dfs.data.dir to a safe dir~~ > > 2010/3/8 William Kang > >> Hi all, >> I am running HDFS in Pseudo-distrib

Re: Namenode problem

2010-03-07 Thread Eason.Lee
Your /tmp directory is cleaned automaticly? Try to set dfs.name.dir/dfs.data.dir to a safe dir~~ 2010/3/8 William Kang > Hi all, > I am running HDFS in Pseudo-distributed mode. Every time after I restarted > the machine, I have to format the namenode otherwise the localhost:50070 > wont show up

Namenode problem

2010-03-07 Thread William Kang
Hi all, I am running HDFS in Pseudo-distributed mode. Every time after I restarted the machine, I have to format the namenode otherwise the localhost:50070 wont show up. It is quite annoying to do so since all the data would be lost. Does anybody know this happens? And how should I fix this problem

Re: Shuffle In Memory OutOfMemoryError

2010-03-07 Thread Ted Yu
Lowering mapred.job.shuffle.input.buffer.percent would be the option to choose. Maybe GC wasn't releasing memory fast enough for in memory shuffling. On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins wrote: > > Thanks Ted. Very helpful. You are correct that I misunderstood the code > at ReduceTask

RE: Copying files between two remote hadoop clusters

2010-03-07 Thread zhuweimin
Hi Hdfs shell command standard I/O is supported. I think if you using it for avoiding temporary save to local file system. For example: wget https://web-server/file-path -O - | hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Refer to this URL http://hadoop.apache.org/common/docs/curren

RE: Shuffle In Memory OutOfMemoryError

2010-03-07 Thread Andy Sautins
Thanks Ted. Very helpful. You are correct that I misunderstood the code at ReduceTask.java:1535. I missed the fact that it's in a IOException catch block. My mistake. That's what I get for being in a rush. For what it's worth I did re-run the job with mapred.reduce.parallel.copies set

Re: Shuffle In Memory OutOfMemoryError

2010-03-07 Thread Ted Yu
My observation is based on this call chain: MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling ramManager.canFitInMemory(decompressedLength) Basically ramManager.canFitInMemory() makes decision without considering the number of MapOutputCopiers that are running. Thus 1.25 *

RE: Shuffle In Memory OutOfMemoryError

2010-03-07 Thread Andy Sautins
Ted, I'm trying to follow the logic in your mail and I'm not sure I'm following. If you would mind helping me understand I would appreciate it. Looking at the code maxSingleShuffleLimit is only used in determining if the copy _can_ fit into memory: boolean canFitInMemory(long

Re: Shuffle In Memory OutOfMemoryError

2010-03-07 Thread Jacob R Rideout
Ted, Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have some time to write a patch later this week. Jacob Rideout On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote: > I think there is mismatch (in ReduceTask.java) between: >      this.numCopiers = conf.getInt("mapred.reduce.parall

Re: Copying files between two remote hadoop clusters

2010-03-07 Thread zenMonkey
distcp seems to copy between clusters. http://hadoop.apache.org/common/docs/current/distcp.html http://hadoop.apache.org/common/docs/current/distcp.html zenMonkey wrote: > > I want to write a script that pulls data (flat files) from a remote > machine and pushes that into its hadoop cluster

Re: Parallelizing HTTP calls with Hadoop

2010-03-07 Thread Mark Kerzner
Phil, what you are describing is close to what Nutch is already doing. You can look at it - all this coding is non-trivial, and you can save yourself a lot of work and debugging. Mark On Sun, Mar 7, 2010 at 8:30 AM, Zak Stone wrote: > Hi Phil, > > If you treat each HTTP request as a Hadoop tas

Re: Parallelizing HTTP calls with Hadoop

2010-03-07 Thread Zak Stone
Hi Phil, If you treat each HTTP request as a Hadoop task and the individual HTTP responses are small, you may find that the latency of the web service leaves most of your Hadoop processes idle most of the time. To avoid this problem, you can let each mapper make many HTTP requests in parallel, ei

Error running Hadoop Job

2010-03-07 Thread Varun Thacker
I have compiled the program without errors. This is what my .jar file looks like: Its name is Election.jar Directory:had...@varun:~/ hadoop-0.20.1 Inside the jar these are the files Election.class Election$Reduce.class Election$Map.class META-INF/ Manifest-Version: 1.0 Created-By: 1.6.0_0 (Sun Mic

Re: Parallelizing HTTP calls with Hadoop

2010-03-07 Thread prasenjit mukherjee
Thanks to Mridul, here is an approach suggested by him based on pig, which works fine for me : input_lines = load 'my_s3_list_file' as (location_line:chararray); grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED; actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group); I

Parallelizing HTTP calls with Hadoop

2010-03-07 Thread Phil McCarthy
Hi, I'm new to Hadoop, and I'm trying to figure out the best way to use it to parallelize a large number of calls to a web API, and then process and store the results. The calls will be regular HTTP requests, and the URLs follow a known format, so can be generated easily. I'd like to understand h

I got java crash error when writing SequenceFile

2010-03-07 Thread forbbs forbbs
*I just ran a simple program, and got the error below:* # # A fatal error has been detected by the Java Runtime Environment: # # SIGFPE (0x8) at pc=0x0030bda07927, pid=22891, tid=1076017504 # # JRE version: 6.0_18-b07 # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode linux-amd