Re: how to design the mapper and reducer for the below problem

2013-06-14 Thread Azuryy Yu
This is a Graph problem, you want to find all joined sub graphs, so i don't think it's easy using Map/Reduce. but you can try Yarn, it can be iterated easily, at least compared with M/R. On Fri, Jun 14, 2013 at 12:41 PM, parnab kumar wrote: > Consider a following input file of format : > inpu

Re: how to design the mapper and reducer for the below problem

2013-06-14 Thread Bhasker Allene
That looks like graph algorithms using MapReduce. Sorry couldn't give you specific answer! On 14/06/2013 05:41, parnab kumar wrote: Consider a following input file of format : input File : 1 2 2 3 3 4 6 7 7 9 10 11 The output Should be as follows : 1 2 3 4 6 7 9 10 11

Re: how to design the mapper and reducer for the below problem

2013-06-14 Thread Harsh J
Hey Parnab, Please checkout Giraph (http://giraph.apache.org), which should help you develop a program to solve this. On Fri, Jun 14, 2013 at 10:11 AM, parnab kumar wrote: > Consider a following input file of format : > input File : > 1 2 > 2 3 > 3 4 > 6 7 > 7 9 > 10 11 > > The output Should be

Re: Assigning the same partition number to the mapper output

2013-06-14 Thread Rahul Bhattacharjee
Some flexibility is there when it comes to changing the name of the output. Check out MultipleOutputs Never used it with a map only job. Thanks, Rahul On Thu, Jun 13, 2013 at 8:33 AM, Maysam Yabandeh wrote: > Hi, > > I was wondering if it is possible in hadoop to assign the same partition > nu

Migration needed when updating within an Hadoop release

2013-06-14 Thread Schad, Bjoern-Bernhard (EXT-Redknee - DE/Berlin)
Hi, has it ever happened that a migration of persistent data has been needed (or automatically executed) when updating a Hadoop installation within a release? If so, where could I find information regarding such needed migration? I would be interested because the runtime of such migration would

Re: Migration needed when updating within an Hadoop release

2013-06-14 Thread Alexander Alten-Lorenz
Hi Björn, > has it ever happened that a migration of persistent data has been needed (or > automatically executed) when updating a Hadoop installation within a release? > If so, where could I find information regarding such needed migration? Normally, when you change the minor release, you need

Re: Migration needed when updating within an Hadoop release

2013-06-14 Thread Alexander Alten-Lorenz
Excuse the typo, should be : Normally, when you change the >major< release, you need to upgrade HDFS (http://hadoop.apache.org/docs/stable/hdfs_user_guide.html#Upgrade+and+Rollback). This will happen when you switch major branches. On Jun 14, 2013, at 12:10 PM, Alexander Alten-Lorenz wrote:

Re: Application errors with one disk on datanode getting filled up to 100%

2013-06-14 Thread Rahul Bhattacharjee
Thanks Mayank, Any clue on why was only one disk was getting all writes. Rahul On Thu, Jun 13, 2013 at 11:47 AM, Mayank wrote: > So we did a manual rebalance (followed instructions at: > http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3

Re: Application errors with one disk on datanode getting filled up to 100%

2013-06-14 Thread Mayank
No, as of this moment we've no ideas about the reasons for that behavior. On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee < rahul.rec@gmail.com> wrote: > Thanks Mayank, Any clue on why was only one disk was getting all writes. > > Rahul > > > On Thu, Jun 13, 2013 at 11:47 AM, Mayank wr

RE: Migration needed when updating within an Hadoop release

2013-06-14 Thread Schad, Bjoern-Bernhard (EXT-Redknee - DE/Berlin)
Hello Alexander, thanks for your reply. This is very interesting for me indeed. - But what about minor updates, e. g. from 1.0.1 to 1.0.4? Has this ever happened for such updates? Also I have got a similar question regarding HBase. I understand that HBase has its own datamodel on top of/ withi

RE: Application errors with one disk on datanode getting filled up to 100%

2013-06-14 Thread Sandeep L
Rahul, In general this issue happens some times in Hadoop. There is no exact reason for this.To mitigate this you need to run balancer in regular intervals. Thanks,Sandeep. Date: Fri, 14 Jun 2013 16:39:02 +0530 Subject: Re: Application errors with one disk on datanode getting filled up to 100% Fr

Re: Migration needed when updating within an Hadoop release

2013-06-14 Thread Alexander Alten-Lorenz
Hi Björn, > - But what about minor updates, e. g. from 1.0.1 to 1.0.4? Has this ever > happened for such updates? You will probably see log messages like 'RPC version mismatch', in this case you have to upgrade the filesystem. If not - all well :) > - What about HBase minor releases in this co

Re: Application errors with one disk on datanode getting filled up to 100%

2013-06-14 Thread Rahul Bhattacharjee
Thanks Sandeep, I was thinking that the overall hdfs cluster might get unbalanced over the time and balancer might be useful in that case. I was more interested to know why only one disk out of configured 4 disks of the DN is getting all the writes.As per whatever I have read , writes should be in

Re: Application errors with one disk on datanode getting filled up to 100%

2013-06-14 Thread Rahul Bhattacharjee
I wasnt aware of data node level balancer procedure , I was thinking about the hdfs balancer . http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F Thanks, Rahul On Fri, Jun 14, 2013 at 5:50 PM, Rahul Bhattacharjee < rahul.rec@gmail.co

RE: Application errors with one disk on datanode getting filled up to 100%

2013-06-14 Thread Sandeep L
Rahul, In general most of the times Hadoop tries to compute data locally that is, if run a MapReduce task on particular input, Hadoop will try compute data locally and write data locally(Majority of times this will happen), replicate in other nodes. In your scenario majority of your input data m

Re: Application errors with one disk on datanode getting filled up to 100%

2013-06-14 Thread Rahul Bhattacharjee
Thanks Sandeep. Yes , thats correct , I was more interested to know about the uneven distribution within the DN. Thanks, Rahul On Fri, Jun 14, 2013 at 6:12 PM, Sandeep L wrote: > Rahul, > > In general most of the times Hadoop tries to compute data locally that is, > if run a MapReduce task on p

How to design the mapper and reducer for the following problem

2013-06-14 Thread parnab kumar
An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file is of the following form : input: - DOCID1 HASH1 HASH2 HASH3 HASH4 DOCID2 HASH5 HASH3 HASH1 HASH4 The output of the mapreduce job

Re: How to design the mapper and reducer for the following problem

2013-06-14 Thread Sanjay Subramanian
Hi My quick and dirty non-optimized solution would be as follows MAPPER === OUTPUT from Mapper REDUCER Iterate over keys For a key = (say) {HASH1,HASH2,HASH3,HASH4} Format the collection of values into some StringBuilder kind of class Output KEY = {DOCID1

Many Errors at the last step of copying files from _temporary to Output Directory

2013-06-14 Thread Sanjay Subramanian
Hi My environment is like this INPUT FILES == 400 GZIP files , one from each server - average size gzipped 25MB REDUCER === Uses MultipleOutput OUTPUT (Snappy) === /path/to/output/dir1 /path/to/output/dir2 /path/to/output/dir3 /path/to/output/dir4 Number of output directories

webhdfs read error after successful pig job

2013-06-14 Thread Adam Silberstein
Hi, I'm having some trouble with webhdfs read after running a Pig job that completed successfully. Here are some details: -I am using Hadoop CDH-4.1.3 and the compatible Pig that goes with this (0.10.0 I think) -The Pig job writes out about 10 files. I'm programmatically attempting to read e

Re: webhdfs read error after successful pig job

2013-06-14 Thread Ed Serrano
You might want to investigate if your issue is aways on the same node. On Fri, Jun 14, 2013 at 11:43 AM, Adam Silberstein wrote: > Hi, > I'm having some trouble with webhdfs read after running a Pig job that > completed successfully. > > Here are some details: > > -I am using Hadoop CDH-4.1.3 an

Re: HDFS to a different location other than HADOOP HOME

2013-06-14 Thread Raj Hadoop
I have modified dfs.data.dir from the default value to another value which is outside 'HADOOP_HOME' and it is >/SD1/hadoop_data.   dfs.data.dir /SD1/hadoop_data Where DataNodes store their blocks   My data node is not starting after the above change. Can anyone tell whats the issue? _

Re: HDFS to a different location other than HADOOP HOME

2013-06-14 Thread Mohammad Tariq
Change the permissions of /SD1/hadoop_data to 755 and restart the process. Warm Regards, Tariq cloudfront.blogspot.com On Fri, Jun 14, 2013 at 11:10 PM, Raj Hadoop wrote: > I have modified dfs.data.dir from the default value to another value which > is outside 'HADOOP_HOME' and it is >/SD1/had

Re: HDFS to a different location other than HADOOP HOME

2013-06-14 Thread Raj Hadoop
  But Tariq - /SD1/hadoop_data is in a separate group.   My hadoop is in group 'grp1' Filesystem /SD1 is under group 'grp2'   As I have space under SD1, I want to use SD1 for hadoop.   so setting permissions like '755' on /SD1/hadoop_data -> will it work? From: Mo

RE: container allocation

2013-06-14 Thread John Lilley
Thanks that is good to know. Is there any way to say "please fail if I don't get the node I want?" Do I just release the container and try again? I'd like to understand the implications of this policy. Suppose I have 1000 data splits and cluster capacity of 100 containers. If I try to schedul

RE: Assignment of data splits to mappers

2013-06-14 Thread John Lilley
Bertrand, Thanks for taking the time to explain this! I understand your point about contiguous blocks; they just aren't likely to exist. I am still curious about two things: 1) The map-per-block strategy. If we have a lot more blocks than containers, wouldn't there be some advantage to ha

Re: container allocation

2013-06-14 Thread Sandy Ryza
Hi John, At this time, releasing containers is the preferred way to be strict about your locality requirements. This is not included in a release yet, but https://issues.apache.org/jira/browse/YARN-392 allows expressing hard locality constraints on requests, so you can tell the scheduler to never

Different container sizes for MR tasks, job does not complete

2013-06-14 Thread Yuzhang Han
Hi, I am using MapReduce on YARN. I want to make tasks of the same job run in containers with different sizes. For example: Job1 = Task8>. Task1 := 1280 MB; Task2 to 8 := 1024 MB. To achieve this, I manually call reqEvent.getCapability().setMemory(MEMORY_SIZE) in RMContainerAllocator.java wit

Question about Skip Bad Records

2013-06-14 Thread ????
Hi, I found the SkippingRecordReader is no longer supported in the new api and I am curious about the reason, can anyone tell me. Besides, when I look into the old api and try to figure out what skip mode was doing, I am a little confused about the logic there. In my comprehension, if java api