Re: Reg: parsing all files file append

2012-09-10 Thread Bejoy Ks
Hi Manoj From my limited knowledge on file appends in hdfs , i have seen more recommendations to use sync() in the latest releases than using append(). Let us wait for some commiter to authoritatively comment on 'the production readiness of append()' . :) Regards Bejoy KS On Mon, Sep 10, 2012

Re: Reg: parsing all files file append

2012-09-10 Thread Manoj Babu
Thank you Bejoy. Cheers! Manoj. On Mon, Sep 10, 2012 at 1:36 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Manoj From my limited knowledge on file appends in hdfs , i have seen more recommendations to use sync() in the latest releases than using append(). Let us wait for some commiter to

Re: Counters across all jobs

2012-09-10 Thread Robin Verlangen
Hi Subbu, You're probably looking for something called Distributed counters. Take a look at this question at StackOverflow: http://stackoverflow.com/questions/2671858/distributed-sequence-number-generation Best regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E

Re: Counters across all jobs

2012-09-10 Thread Vinod Kumar Vavilapalli
Counters are per-job in Hadoop MapReduce. You need an external aggregator for such cross-job counters - for e.g. a node in Zookeeper. Also, is it just for display or your job-logic depends on this? If it is the earlier, and if you don't have a problem with waiting till jobs finish, you can do

AUTO: Yuan Jin is out of the office. (returning 09/11/2012)

2012-09-10 Thread Yuan Jin
I am out of the office until 09/11/2012. I am out of office. For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM) For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM) For TMB related things, you can contact Flora(Jun Ying Li/China/IBM) For TWB

Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

2012-09-10 Thread Safdar Kureishy
Hi, I need to run some benchmarking tests for a given mapreduce job on a *subset *of a 10-node Hadoop cluster. Not that it matters, but the current cluster settings allow for ~20 map slots and 10 reduce slots per node. Without loss of generalization, let's say I want a job with these constraints

Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

2012-09-10 Thread Hemanth Yamijala
Hi, I am not sure if there's any way to restrict the tasks to specific machines. However, I think there are some ways of restricting to number of 'slots' that can be used by the job. Also, not sure which version of Hadoop you are on. The capacityscheduler

Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

2012-09-10 Thread Safdar Kureishy
Thanks Bertrand/Hemanth, for your prompt replies! This helps :) Regards, Safdar On Mon, Sep 10, 2012 at 2:18 PM, Bertrand Dechoux decho...@gmail.comwrote: If that is only for benchmarking, you could stop the task-trackers on the machines you don't want to use. Or you could setup another

Re: One petabyte of data loading into HDFS with in 10 min.

2012-09-10 Thread prabhu K
Hi Users, Thanks for the response. We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration. Each Node (1 machine master, 2 machines are slave) 1.500 GB hard disk. 2.4Gb RAM 3.3 quad code CPUs. 4.Speed 1333 MHz Now, we are planning to load 1

Re: One petabyte of data loading into HDFS with in 10 min.

2012-09-10 Thread Steve Loughran
On 10 September 2012 08:40, prabhu K prabhu.had...@gmail.com wrote: Hi Users, Thanks for the response. We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration. Each Node (1 machine master, 2 machines are slave) 1.500 GB hard disk. 2.4Gb RAM 3.

RE: build failure - trying to build hadoop trunk checkout

2012-09-10 Thread Tony Burton
Thanks for the replies all. I'll investigate changing my hostname and report back. (Seems a bit hacky though - can someone explain using easy words why this happens in Kerberos?) Tony From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com] Sent: 06 September 2012 18:51 To:

Undeliverable messages

2012-09-10 Thread Tony Burton
Sorry for admin-only content: can we remove this address from the list? I get the bounce message below whenever I post to user@hadoop.apache.orgmailto:user@hadoop.apache.org. Thanks! Tony _ From: postmas...@sas.sungardrs.com

Re: Reading from HDFS from inside the mapper

2012-09-10 Thread Hemanth Yamijala
Hi, You could check DistributedCache ( http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache). It would allow you to distribute data to the nodes where your tasks are run. Thanks Hemanth On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann

Re: Undeliverable messages

2012-09-10 Thread Harsh J
I'd filed this as much, but unsure how to get it done: https://issues.apache.org/jira/browse/HADOOP-8062. I'm not an admin on the mailing lists. On Mon, Sep 10, 2012 at 3:16 PM, Tony Burton tbur...@sportingindex.com wrote: Sorry for admin-only content: can we remove this address from the list? I

Re: NoSuchMethodException when using old mapred apis

2012-09-10 Thread Harsh J
Hi, Your failure seems to be in the task-side. I suspect a mix of libraries. What version of Hadoop are you *deploying* across all nodes? On Mon, Sep 10, 2012 at 3:56 PM, Li Li fancye...@gmail.com wrote: hi all, I am trying an example from an tutorial for version 0.19 by using hadoop

how to skip a mapper

2012-09-10 Thread Anit Alexander
Hello list, Is it possible to start the mapper from a particular byte location in a file which is in hdfs? Regards, Anit

Re: NoSuchMethodException when using old mapred apis

2012-09-10 Thread Harsh J
P.s. Please see my reply on the stackoverflow link you'd sent, if you are hitting the same problem. On Mon, Sep 10, 2012 at 4:53 PM, Harsh J ha...@cloudera.com wrote: Hi, Your failure seems to be in the task-side. I suspect a mix of libraries. What version of Hadoop are you *deploying*

Re: how to skip a mapper

2012-09-10 Thread Harsh J
Anit, Yes this is possible (and actually does happen in regular MR scenario anyway - when the input is split across several locations). You'll need a custom InputFormat#getSplits implementation to do this (create input splits with the first offset itself set to the known offset location, instead

Re: Reading from HDFS from inside the mapper

2012-09-10 Thread Harsh J
Sigurd, Hemanth's recommendation of DistributedCache does fit your requirement - it is a generic way of distributing files and archives to tasks of a job. It is not something that pushes things automatically in memory, but on the local disk of the TaskTracker your task runs on. You can choose to

Re: One petabyte of data loading into HDFS with in 10 min.

2012-09-10 Thread Michael Segel
On Sep 10, 2012, at 2:40 AM, prabhu K prabhu.had...@gmail.com wrote: Hi Users, Thanks for the response. We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration. Each Node (1 machine master, 2 machines are slave) 1.500 GB hard disk. 2.4Gb RAM

Re: Reading from HDFS from inside the mapper

2012-09-10 Thread Sigurd Spieckermann
OK, interesting. Just to confirm: is it okay to distribute quite large files through the DistributedCache? Dataset B could be on the order of gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then the probability that every node will have to read (almost) every block of B is

Re: best way to join?

2012-09-10 Thread Mirko Kämpf
Hi Dexter, I am no sure if I understood your requirements right. So I repet it to define a starting point. 1.) You have a (static) list of points (the points.txt file) 2.) Now you want to calculate the nearest points to a set of given points. Are the points which have to be considered in a

Re: Job Controller for MapReduce task assignment

2012-09-10 Thread John Cuffney
Hey, That's very helpful, thank you. I guess to be more clear about what I'm doing, I want to have a simulation that runs through the mapping portion of the MR, Stops, sets a checkpoint, then runs the reduce portion of the MR. So I guess the issue is finding a point in between the Map and

RE: One petabyte of data loading into HDFS with in 10 min.

2012-09-10 Thread Siddharth Tiwari
Well can't you load the incremental data only ? as the goal seems quite unrealistic. The big guns have already spoken :P ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe

Record seperator

2012-09-10 Thread Siddharth Tiwari
Hi list, Out of context, does any one encountered record separator delimiter problem. I have a log file in which each record is separated using RECORD SEPERATOR delimiter ( ^^ ) , can any one help me on this on how can I use it as delimiter ? Thanks ** Cheers !!!

Re: One petabyte of data loading into HDFS with in 10 min.

2012-09-10 Thread Michael Segel
Well I think the question would make more sense if he meant to say how one could load a GB file within 10 mins. Note that 1x10^6 GB are in a PB. (Hence the comment about being off by several orders of magnitude. ) Now were the OP asking about how to load 1GB file in 10min, then you're

Which metrics to track?

2012-09-10 Thread Jones, Robert
Hello all, I am a sysadmin and do not know that much about Hadoop. I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis. I know I can access several hadoop metrics via jmx by going to

RE: Record seperator

2012-09-10 Thread Siddharth Tiwari
Hi Harsh, I am using CDH3U4. The records are seperated by following ascii ^^ 30 1E RS ␞ Record Separator I did not understand what u intend me to do so that I can use this one ? Thanks ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy,

Re: Can't run PI example on hadoop 0.23.1

2012-09-10 Thread Vinod Kumar Vavilapalli
The AM corresponding to your MR job is failing continuously. Can you check the container logs for your AM ? They should be in ${yarn.nodemanager.log-dirs}/${application-id}/container_[0-9]*_0001_01_01/stderr Thanks, +Vinod On Sep 10, 2012, at 3:19 PM, Smarty Juice wrote: Hello

Re: Understanding of the hadoop distribution system (tuning)

2012-09-10 Thread Jagat Singh
Hello Elaine, You did not tell your cluster size. Number of nodes , cores in each node. What sort of work you are doing , 6 hours for 518MB data is huge time. The number of map tasks would be 518/64 So this many map tasks needs to run to process your data. Now they can run on single node or

Re: Which metrics to track?

2012-09-10 Thread Adam Faris
From an operations perspective, hadoop metrics are a bit different then watching hosts behind a load balancer as one needs to start thinking in terms of distributed systems and not individual hosts. The reason being that the hadoop platform is fairly resilient against multiple node failures,

Re: Which metrics to track?

2012-09-10 Thread Gulfie
On Mon, Sep 10, 2012 at 08:16:29PM +, Jones, Robert wrote: Hello all, I am a sysadmin and do not know that much about Hadoop. I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis. I know I can

how to make different mappers execute different processing on same data ?

2012-09-10 Thread Jason Yang
Hi, all I've got a question about how to make different mappers execute different processing on a same data? Here is my scenario: I got to process a data, however, there multiple choices to process this data and I have no idea which one is better, so I was thinking that maybe I could execute

Re: How to make different mappers execute different processing on a same data ?

2012-09-10 Thread Harsh J
Hey Jason, While I am not sure on whats the best way to automatically evaluate during the execution of a job, the MultipleInputs class offers a way to run different map implementations within a single job for different input paths. You could perhaps leverage that with duplicated (or symlinked?)

Re: Understanding of the hadoop distribution system (tuning)

2012-09-10 Thread Hemanth Yamijala
Hi, Responses inline to some points. On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan elaine-...@gmo.jp wrote: Hi, I'm new to hadoop and i've just played around with map reduce. I would like to check if my understanding to hadoop is correct and i would appreciate if anyone could correct me if