Hi Manoj
From my limited knowledge on file appends in hdfs , i have seen more
recommendations to use sync() in the latest releases than using append().
Let us wait for some commiter to authoritatively comment on 'the production
readiness of append()' . :)
Regards
Bejoy KS
On Mon, Sep 10, 2012
Thank you Bejoy.
Cheers!
Manoj.
On Mon, Sep 10, 2012 at 1:36 PM, Bejoy Ks bejoy.had...@gmail.com wrote:
Hi Manoj
From my limited knowledge on file appends in hdfs , i have seen more
recommendations to use sync() in the latest releases than using append().
Let us wait for some commiter to
Hi Subbu,
You're probably looking for something called Distributed counters. Take a
look at this question at StackOverflow:
http://stackoverflow.com/questions/2671858/distributed-sequence-number-generation
Best regards,
Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E
Counters are per-job in Hadoop MapReduce. You need an external aggregator for
such cross-job counters - for e.g. a node in Zookeeper.
Also, is it just for display or your job-logic depends on this? If it is the
earlier, and if you don't have a problem with waiting till jobs finish, you can
do
I am out of the office until 09/11/2012.
I am out of office.
For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM)
For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM)
For TMB related things, you can contact Flora(Jun Ying Li/China/IBM)
For TWB
Hi,
I need to run some benchmarking tests for a given mapreduce job on a *subset
*of a 10-node Hadoop cluster. Not that it matters, but the current cluster
settings allow for ~20 map slots and 10 reduce slots per node.
Without loss of generalization, let's say I want a job with these
constraints
Hi,
I am not sure if there's any way to restrict the tasks to specific
machines. However, I think there are some ways of restricting to
number of 'slots' that can be used by the job.
Also, not sure which version of Hadoop you are on. The
capacityscheduler
Thanks Bertrand/Hemanth, for your prompt replies! This helps :)
Regards,
Safdar
On Mon, Sep 10, 2012 at 2:18 PM, Bertrand Dechoux decho...@gmail.comwrote:
If that is only for benchmarking, you could stop the task-trackers on the
machines you don't want to use.
Or you could setup another
Hi Users,
Thanks for the response.
We have loaded 100GB data loaded into HDFS, time taken 1hr.with below
configuration.
Each Node (1 machine master, 2 machines are slave)
1.500 GB hard disk.
2.4Gb RAM
3.3 quad code CPUs.
4.Speed 1333 MHz
Now, we are planning to load 1
On 10 September 2012 08:40, prabhu K prabhu.had...@gmail.com wrote:
Hi Users,
Thanks for the response.
We have loaded 100GB data loaded into HDFS, time taken 1hr.with below
configuration.
Each Node (1 machine master, 2 machines are slave)
1.500 GB hard disk.
2.4Gb RAM
3.
Thanks for the replies all. I'll investigate changing my hostname and report
back.
(Seems a bit hacky though - can someone explain using easy words why this
happens in Kerberos?)
Tony
From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com]
Sent: 06 September 2012 18:51
To:
Sorry for admin-only content: can we remove this address from the list? I get
the bounce message below whenever I post to
user@hadoop.apache.orgmailto:user@hadoop.apache.org.
Thanks!
Tony
_
From: postmas...@sas.sungardrs.com
Hi,
You could check DistributedCache (
http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
It would allow you to distribute data to the nodes where your tasks are run.
Thanks
Hemanth
On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
I'd filed this as much, but unsure how to get it done:
https://issues.apache.org/jira/browse/HADOOP-8062. I'm not an admin on
the mailing lists.
On Mon, Sep 10, 2012 at 3:16 PM, Tony Burton tbur...@sportingindex.com wrote:
Sorry for admin-only content: can we remove this address from the list? I
Hi,
Your failure seems to be in the task-side. I suspect a mix of libraries.
What version of Hadoop are you *deploying* across all nodes?
On Mon, Sep 10, 2012 at 3:56 PM, Li Li fancye...@gmail.com wrote:
hi all,
I am trying an example from an tutorial for version 0.19 by using
hadoop
Hello list,
Is it possible to start the mapper from a particular byte
location in a file which is in hdfs?
Regards,
Anit
P.s. Please see my reply on the stackoverflow link you'd sent, if you
are hitting the same problem.
On Mon, Sep 10, 2012 at 4:53 PM, Harsh J ha...@cloudera.com wrote:
Hi,
Your failure seems to be in the task-side. I suspect a mix of libraries.
What version of Hadoop are you *deploying*
Anit,
Yes this is possible (and actually does happen in regular MR scenario
anyway - when the input is split across several locations). You'll
need a custom InputFormat#getSplits implementation to do this (create
input splits with the first offset itself set to the known offset
location, instead
Sigurd,
Hemanth's recommendation of DistributedCache does fit your requirement
- it is a generic way of distributing files and archives to tasks of a
job. It is not something that pushes things automatically in memory,
but on the local disk of the TaskTracker your task runs on. You can
choose to
On Sep 10, 2012, at 2:40 AM, prabhu K prabhu.had...@gmail.com wrote:
Hi Users,
Thanks for the response.
We have loaded 100GB data loaded into HDFS, time taken 1hr.with below
configuration.
Each Node (1 machine master, 2 machines are slave)
1.500 GB hard disk.
2.4Gb RAM
OK, interesting. Just to confirm: is it okay to distribute quite large
files through the DistributedCache? Dataset B could be on the order of
gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
the probability that every node will have to read (almost) every block of B
is
Hi Dexter,
I am no sure if I understood your requirements right.
So I repet it to define a starting point.
1.) You have a (static) list of points (the points.txt file)
2.) Now you want to calculate the nearest points to a set of given points.
Are the points which have to be considered in a
Hey,
That's very helpful, thank you. I guess to be more clear about what I'm
doing, I want to have a simulation that runs through the mapping portion of
the MR, Stops, sets a checkpoint, then runs the reduce portion of the MR.
So I guess the issue is finding a point in between the Map and
Well can't you load the incremental data only ? as the goal seems quite
unrealistic. The big guns have already spoken :P
**
Cheers !!!
Siddharth Tiwari
Have a refreshing day !!!
Every duty is holy, and devotion to duty is the highest form of worship of
God.”
Maybe
Hi list,
Out of context, does any one encountered record separator delimiter problem. I
have a log file in which each record is separated using RECORD SEPERATOR
delimiter ( ^^ ) , can any one help me on this on how can I use it as delimiter
?
Thanks
**
Cheers !!!
Well I think the question would make more sense if he meant to say how one
could load a GB file within 10 mins.
Note that 1x10^6 GB are in a PB. (Hence the comment about being off by several
orders of magnitude. )
Now were the OP asking about how to load 1GB file in 10min,
then you're
Hello all, I am a sysadmin and do not know that much about Hadoop. I run a
stats/metrics tracking system that logs stats over time so you can look at
historical and current data and perform some trend analysis. I know I can
access several hadoop metrics via jmx by going to
Hi Harsh,
I am using CDH3U4.
The records are seperated by following ascii
^^
30
1E
RS
␞
Record Separator
I did not understand what u intend me to do so that I can use this one ?
Thanks
**
Cheers !!!
Siddharth Tiwari
Have a refreshing day !!!
Every duty is holy,
The AM corresponding to your MR job is failing continuously. Can you check the
container logs for your AM ? They should be in
${yarn.nodemanager.log-dirs}/${application-id}/container_[0-9]*_0001_01_01/stderr
Thanks,
+Vinod
On Sep 10, 2012, at 3:19 PM, Smarty Juice wrote:
Hello
Hello Elaine,
You did not tell your cluster size. Number of nodes , cores in each node.
What sort of work you are doing , 6 hours for 518MB data is huge time.
The number of map tasks would be 518/64
So this many map tasks needs to run to process your data.
Now they can run on single node or
From an operations perspective, hadoop metrics are a bit different then
watching hosts behind a load balancer as one needs to start thinking in terms
of distributed systems and not individual hosts. The reason being that the
hadoop platform is fairly resilient against multiple node failures,
On Mon, Sep 10, 2012 at 08:16:29PM +, Jones, Robert wrote:
Hello all, I am a sysadmin and do not know that much about Hadoop. I run a
stats/metrics tracking system that logs stats over time so you can look at
historical and current data and perform some trend analysis. I know I can
Hi, all
I've got a question about how to make different mappers execute different
processing on a same data?
Here is my scenario:
I got to process a data, however, there multiple choices to process this
data and I have no idea which one is better, so I was thinking that maybe I
could execute
Hey Jason,
While I am not sure on whats the best way to automatically evaluate
during the execution of a job, the MultipleInputs class offers a way
to run different map implementations within a single job for different
input paths. You could perhaps leverage that with duplicated (or
symlinked?)
Hi,
Responses inline to some points.
On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan elaine-...@gmo.jp wrote:
Hi,
I'm new to hadoop and i've just played around with map reduce.
I would like to check if my understanding to hadoop is correct and i
would appreciate if anyone could correct me if
35 matches
Mail list logo