Regarding the timeout, I think the limiting value can be set in seconds using
the config parameters.
Thanks,
Amogh
-Original Message-
From: Aaron Kimball [mailto:aa...@cloudera.com]
Sent: Friday, July 10, 2009 3:02 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop: Reduce exceedi
Yes it is. However, I assume file 2 is "comparatively" small to be distributed
across all computing nodes without much delay, else the whole point of map side
join is defeated.
If keys in file 2 are unique, it is a simple lookup you need to implement. Else
iterate over them to implement the joi
Confused. What do you mean by "query be distributed over all datanodes or just
1 node" . If your data is small enough so that it fits in just one block ( and
replicated by hadoop ), then just one task will be run ( assuming default input
split).
If the data is spread across multiple blocks, you
If you need to set the java_options for mem., you can do this via configure in
your MR job.
-Original Message-
From: Fernando Padilla [mailto:f...@alum.mit.edu]
Sent: Wednesday, July 22, 2009 9:11 AM
To: common-user@hadoop.apache.org
Subject: best way to set memory
So.. I want to have d
each daemon-type..
bin/hadoop-daemon.sh start namenode
bin/hadoop-daemon.sh start datanode
bin/hadoop-daemon.sh start secondarynamenode
bin/hadoop-daemon.sh start jobtracker
bin/hadoop-daemon.sh start tasktracker
Amogh Vasekar wrote:
> If you need to set the java_options for mem., you can do
Does MultipleOutputFormat suffice?
Cheers!
Amogh
-Original Message-
From: Mark Kerzner [mailto:markkerz...@gmail.com]
Sent: Thursday, July 23, 2009 6:24 AM
To: core-u...@hadoop.apache.org
Subject: Output of a Reducer as a zip file?
Hi,
my output consists of a number of binary files, cor
>> the reducer is called a
>>second time to do nothing, before all is done
Can you elaborate please?
Amogh
-Original Message-
From: Mark Kerzner [mailto:markkerz...@gmail.com]
Sent: Monday, July 27, 2009 8:51 PM
To: core-u...@hadoop.apache.org
Subject: Why is single reducer called twice
This is particularly useful if your input is the output of another MR job, else
is a killer.
You may want to write your own mapper in case one of the files to be joined is
small enough to fit in memory / can be handled in splits.
Thanks,
Amogh
-Original Message-
From: Jason Venner [mail
What is the use case for this? Especially since you have 0 reducers.
Thanks,
Amogh
-Original Message-
From: Saptarshi Guha [mailto:saptarshi.g...@gmail.com]
Sent: Friday, July 31, 2009 12:08 PM
To: core-u...@hadoop.apache.org
Subject: Re: Running 145K maps, zero reduces- does Hadoop scal
Ideally should be done using generic options parser. Please have a look at
ToolRunner for more info.
Thanks,
Amogh
-Original Message-
From: Mark Kerzner [mailto:markkerz...@gmail.com]
Sent: Saturday, August 01, 2009 2:47 AM
To: common-user@hadoop.apache.org
Subject: setting parameters f
Maybe I'm missing the point, but in terms of execution performance benefit,
what does copying to dfs and then compressing to be fed to a map/reduce job
provide? Isn't it better to compress "offline" / outside latency window and
make available on dfs?
Also, your mapreduce program will launch one
Have you had a look at the reporter.counter hadoop provides? I think it might
be helpful in your case, where in you can locally aggregate for each map task
and then push it to global counter.
-Original Message-
From: Zhong Wang [mailto:wangzhong@gmail.com]
Sent: Monday, August 03, 2
10 mins reminds me of parameter mapred.task.timeout . This is configurable. Or
alternatively you might just do a sysout to let tracker know of its existence (
not an ideal solution though )
Thanks,
Amogh
-Original Message-
From: Mathias De Maré [mailto:mathias.dem...@gmail.com]
Sent: We
While setting mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum, please consider the memory usage your
application might have since all tasks will be competing for the same and might
reduce overall performance.
Thanks,
Amogh
-Original Message-
From: Harish
AFAIK,
hadoop.tmp.dir : Used by NN and DN for directory listings and metadata ( don't
have much info on this )
java.opts & ulimit : ulimit defines the maximum limit of virtual mem for task
launched. java.opts is the amount of memory reserved for a task.
When setting you need to account for memo
Hi,
GenericOptionsParser is customized only for Hadoop specific params :
* GenericOptionsParser recognizes several standarad command
* line arguments, enabling applications to easily specify a namenode, a
* jobtracker, additional configuration resources etc.
Ideally, all params must be passe
I'm not sure that is the case with Hadoop. I think its assigning reduce task to
an available tasktracker at any instant; Since a reducer polls JT for completed
maps. And if it were the case as you said, a reducer wont be initialized until
all maps have completed , after which copy phase would st
PM
To: common-user@hadoop.apache.org
Subject: Re: MR job scheduler
Amogh
i think Reduce phase starts only when all the map phases are completed .
Because it needs all the values corresponding to a particular key!
2009/8/21 Amogh Vasekar
> I'm not sure that is the case with Hadoop. I t
ansferring data across the network(because already
many values to that key are on that machine where the map phase completed)..
2009/8/21 Amogh Vasekar
> Yes, but the copy phase starts with the initialization for a reducer, after
> which it would keep polling for completed map tasks to fetc
Maybe look at mapred.reduce.parallel.copies property to speed it up...I don't
see as to why transfer speed be configured via params, and I'm think hadoop
wont be messing with that.
Thanks,
Amogh
-Original Message-
From: yang song [mailto:hadoop.ini...@gmail.com]
Sent: Monday, August 24
Hadoop will make sure that every pair with same key will land up in same
reducer and consumed in a single reduce instance.
-Original Message-
From: Nipun Saggar [mailto:nipun.sag...@gmail.com]
Sent: Tuesday, August 25, 2009 10:41 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop
Hi,
Mapper is used to process the pair passed to it, MapRunnable is an
interface, when implemented is responsible for generating a conforming
pair and pass it to Mapper.
Cheers!
Amogh
-Original Message-
From: Rakhi Khatwani [mailto:rkhatw...@gmail.com]
Sent: Thursday, August 27, 2009
This wont change the daemon configs.
Hadoop by default allocates 1000MB of memory for each of its daemons, which can
be controlled by HADOOP_HEAPSIZE, HADOOP_NAMENODE_OPTS, HADOOP_TASKTRACKER_OPTS
in the hadoop script.
However, there was a discussion on this sometime back wherein these options
w
y generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---
Cheers!
Amogh
-Original Message-
From: Stas Oskin [mailto:stas.os...@gmail.com]
Sent: Tuesday, September 01, 2009 2:31 PM
To: common-user@hadoop.apache.org
Subject: Re: Datanode high me
AFAIK, releaseCache only works on cleaning reference to your file. Try using
deletecache in synchronized manner.
Thanks,
Amogh
-Original Message-
From: #YONG YONG CHENG# [mailto:aarnc...@pmail.ntu.edu.sg]
Sent: Thursday, September 03, 2009 8:50 AM
To: common-user@hadoop.apache.org
Subje
Before setting the task limits, do take into account the memory considerations
( many archive posts on this can be found ).
Also, your tasktracker and datanode daemons will run on that machine as well,
so you might want to set aside some processing power for that.
Cheers!
Amogh
-Original M
Have a look at jobclient, it should suffice.
Cheers!
Amogh
-Original Message-
From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com]
Sent: Friday, September 04, 2009 9:15 PM
To: common-user@hadoop.apache.org
Subject: Re: Some issues!
Hey ,
I have one more doubt , Suppose
t: RE: DistributedCache purgeCache()
Thanks for your swift response.
But where can I find deletecache()?
Thanks.
-Original Message-
From: Amogh Vasekar [mailto:am...@yahoo-inc.com]
Sent: Thu 9/3/2009 2:44 PM
To: common-user@hadoop.apache.org
Subject: RE: DistributedCache purgeCache()
An alternative will be to use hadoop fs apis to recursively list file status
and pass that as the input files . This is slightly complicated but will give
you more control and might help while debugging as well.
Just a thought.
Thanks,
Amogh
-Original Message-
From: Amandeep Khurana [ma
Hi,
Ran into a similar issue : https://issues.apache.org/jira/browse/HBASE-1791
Not sure if what you are experiencing is similar.
Context.progress() "should" work. One ugly hack would be to set the timeout
value to high number. But I would wait for a better answer before doing that.
Thanks,
Amog
Hi All,
Regarding the JVM reuse feature incorporated, it says reuse is generally
recommended for streaming and pipes jobs. I'm a little unclear on this and any
pointers will be appreciated.
Also, in what scenarios will this feature be helpful for java mapred jobs?
Thanks,
Amogh
Hi,
Funny enough was looking at it just yesterday.
http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Task+JVM+Reuse
Thanks,
Amogh
-Original Message-
From: Zhimin [mailto:wan...@cs.umb.edu]
Sent: Tuesday, September 15, 2009 10:53 PM
To: core-u...@hadoop.apache.org
Subject
Hi,
Please check the namenode heap usage. Your cluster may be having too many files
to handle / too little free space. It is generally available in the UI. This is
one of the causes I have seen for the Timeout.
Amogh
-Original Message-
From: Kunsheng Chen [mailto:ke...@yahoo.com]
Sent:
Along with partitioner, try to plug in a combiner. It would provide significant
performance gains. Not sure about the algo you use, but might have to tweak
that a little to facilitate a combiner.
Thanks,
Amogh
-Original Message-
From: Chandraprakash Bhagtani [mailto:cpbhagt...@gmail.com
I believe framework checks timestamps on HDFS for marking an already available
copy of the file valid or invalid, since the archived files are not cleaned up
till a certain du limit is reached, and no apis for cleanup available. There
was a thread on this some time back on the list.
Amogh
Hi,
I guess configure is now setup(), and using toolrunner can create a
configuration / context to mimic the required behavior.
Thanks,
Amogh
-Original Message-
From: Amandeep Khurana [mailto:ama...@gmail.com]
Sent: Tuesday, October 06, 2009 5:43 AM
To: common-user@hadoop.apache.org
Su
Hi Huang,
Haven't worked with Hbase but in general,
If you want to have control over what data split to go as a whole to mapper,
easiest way is to compress that split in single file; making as many split
files as needed. If you need to know what file is currently being processed,
you can use ma
>> You can always pass them as comma delimited strings
Which would be pretty expensive per right? Would avro be looking into
solving such problems?
Amogh
-Original Message-
From: Jason Venner [mailto:jason.had...@gmail.com]
Sent: Tuesday, October 06, 2009 11:33 AM
To: common-user@hadoo
For starters look at any monitoring tool like vaidya, hadoop UI ( ganglia too,
haven't read much on it though ). Not sure if you need this for debugging
purposes or for some other real-time app.. You should be able to get info on
localhost of each of your map tasks in a pretty straightforward wa
Hi,
AFAIK readline is not recommended on DataInput types. Also, look into
writableutils to see if something there may be used.
Hope this helps.
Amogh
On 10/15/09 9:31 AM, "z3r0c001" wrote:
I'm trying to implement Writable interface. but not sure how to
serialize/write/read data from nested ob
g Van Nguyen Dinh" wrote:
Thanks Amogh. For my application, I want each map task reports to me
where it's running. However, I have no idea how to use Java
Inetaddress APIs to get that info. Could you explain more?
Van
On Wed, Oct 14, 2009 at 2:16 PM, Amogh Vasekar wrote:
> For st
Hi,
Check the distributed cache APIs, it provides various functionalities to
distribute and add jars to classpath on compute machines.
Amogh
On 10/19/09 3:38 AM, "yz5od2" wrote:
Hi,
What is the preferred method to distribute the classes (in various
Jars) to my Hadoop instances, that are requi
Hi,
It would be more helpful if you provide the exact error here.
Also, hadoop uses the local FS to store intermediate data, along with HDFS for
final output.
If your job is memory intensive, try limiting the number of tasks you are
running in parallel on a machine.
Amogh
On 10/19/09 8:27 AM,
For skipping failed tasks try : mapred.max.map.failures.percent
Amogh
On 10/21/09 8:58 AM, "梁景明" wrote:
hi, I use hadoop0.20 and 8 nodes, there is a job that has 130 map to run
,and completed 128 map,
but only 2 map fail ,and its fail in my case is accepted ,but the job fail
,the last 128 map a
Hi,
On what parameters does the output key of your (first) reducer depend?
Amogh
On 10/23/09 8:24 AM, "Aaron Kimball" wrote:
If you need another shuffle after your first reduce pass, then you need a
second MapReduce job to run after the first one. Just use an IdentityMapper.
This is a reasonab
Hi,
Many options available here. You can use jobconf (0.18 ) / context.conf (0.20)
to pass these lines across all tasks ( assuming the size isnt relatively large
) and use configure / setup to retrieve these.. Or use distributed cache to
read a file containing these lines ( possibly with jvm reu
Hi,
Reduce task looks at map tasks for the partition it requires, and pulls it (
the number of parallel copies is controlled by reduce.parallel.copies ). As
partitions are taken in by reduce task, it performs a merge sort, this forms
your S&S phase. Typically your mappers / reducers are O(n) ,
Hi Bhushan,
If splitting input files is an option, why don't you let hadoop do this for
you? If need be you may use a custom input format and sequencefile*outputformat.
Amogh
On 10/27/09 7:55 PM, "bhushan_mahale" wrote:
Hi Jason,
Thanks for the reply.
The string is the entire content of the
Hi,
Rebalancer should help you : http://issues.apache.org/jira/browse/HADOOP-1652
Amogh
On 10/28/09 2:54 PM, "Vibhooti Verma" wrote:
Hi All,
We are facing the issue with distribution of data in a cluster where nodes
have differnt storage capacity.
We have 4 nodes with 100G capacity and 1 node w
Hi,
Quick questions...
Are you creating too many small files?
Are there any task side files being created?
Is the heap for NN having enough space to list metadata? Any details on its
general health will probably be helpful to people on the list.
Amogh
On 11/2/09 2:02 PM, "Zhang Bingjun (Eddy)
Mark,
Set-up for a mapred job consumes a considerable amount of time and resources
and so, if possible a single job is preferred.
You can add multiple paths to your job, and if you need different processing
logic depending upon the input being consumed, you can use parameter
map.input.file in yo
Hi Mark,
A future release of Hadoop will have a MultipleInputs class, akin to
MultipleOutputs. This would allow you to have a different inputformat, mapper
depending on the path you are getting the split from. It uses special
Delegating[mapper/input] classes to resolve this. I understand backpor
Replies inline.
On 11/14/09 9:55 PM, "Hrishikesh Agashe"
wrote:
Hi,
Default DFS block size is 64 MB. Does this mean that if I put file less than 64
MB on HDFS, it will not be divided any further?
--Yes, file will be stored in single block per replica.
I have lots and lots if XMLs and I wo
>> I would like the connection management to live separately
>>from the mapper instances per node.
The JVM reuse option in Hadoop might be helpful for you in this case.
Amogh
On 11/16/09 6:22 AM, "yz5od2" wrote:
Hi,
a) I have a Mapper ONLY job, the job reads in records, then parses
them apart.
And, a relatively high replication factor on files to be distributed will help
:)
Amogh
On 11/16/09 9:05 AM, "Ed Kohlwey" wrote:
Hi,
What you can fit in distributed cache generally depends on the available
disk space on your nodes. With most clusters 300 mb will not be a problem,
but it depen
Hi,
This is the time for all three phases of reducer right?
I think its due to the constant spilling for a single key to disk since the map
partitions couldn't be held in-mem due to buffer limit. Did the other reducer
have numerous keys with low number of values ( ie smaller partitions? )
Thanks
MultipleOutputFormat and MOS are to be merged :
http://issues.apache.org/jira/browse/MAPREDUCE-370
Amogh
On 11/18/09 12:03 PM, "Y G" wrote:
in the old MR API ,there is MutilOutputFormat class which i can use to
custom the reduce output file name.
it's very useful for me.
but i can't find it i
Hi,
JobClient (.18) / Job(.20) class apis should help you achieve this.
Amogh
On 11/19/09 1:40 AM, "Gang Luo" wrote:
HI all,
I am going to execute multiple mapreduce jobs in sequence, but whether or not
to execute a job in that sequence could not be determined beforehand, but
depend on the r
Hi,
keep.tasks.files.pattern is what you need, as the name suggests its a pattern
match on intermediate outputs generated.
Wrt to copying map data to hdfs, your mappers close() method should help you
achieve this, but might slow up your tasks.
Amogh
On 11/23/09 8:08 AM, "Jeff Zhang" wrote:
Hi,
I'm not sure if this will apply to your case since i'm not aware of the common
part of job2:mapper and job3:mapper but would like to give it a shot.
The whole process can be combined into a single mapred job. The mapper will
read a record and process till the "saved data part" , then for each
Hi,
For "near" real time performance you may try Hbase. I had read about Streamy
doing this, and their hadoop-world-nyc ppt is available on their blog:
http://devblog.streamy.com/2009/07/24/streamy-hadoop-summit-hbase-goes-realtime/
Amogh
On 11/25/09 1:31 AM, "onur ascigil" wrote:
Thanks f
Hi,
".deflate" is the default compression codec used when parameter to generate
compressed output is true ( mapred.output.compress ).
You may set the codec to be used via mapred.output.compression.codec, some
commonly used are available in hadoop.io.compress package...
Amogh
On 11/26/09 11:03
Conf.get(map.input.file) is what you need.
Amogh
On 11/26/09 12:35 PM, "Saptarshi Guha" wrote:
Hello,
I have a set of input files part-r-* which I will pass through another
map(no reduce). the part-r-* files consist of key, values, keys being
small, values fairly large(MB's)
I would like to
Configuration();
System.out.println("mapred.input.file="+cfg.get("mapred.input.file"));
displays null, so maybe this fell out by mistake in the api change?
Regards
Saptarshi
On Thu, Nov 26, 2009 at 2:13 AM, Saptarshi Guha
wrote:
> Thank you.
> Regards
> Saptarshi
>
&
Hi,
Task slots reuse JVM over the course of entire job right? Specifically, would
like to point to :
http://issues.apache.org/jira/browse/MAPREDUCE-453?focusedCommentId=12619492&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12619492
Thanks,
Amogh
On 11/30/09 5:44
Hi,
What are your intermediate & output class formats? “Text” format is
inherently UTF-8 encoded. If you want end-to-end processing to be via gbk
encoding, you may have to write a custom writable type.
Amogh
On 11/30/09 7:09 PM, "郭鹏" wrote:
> I know the default output coder is utf-8, but how
Hi,
Please try removing the combiner and running.
I know that if you use multiple outputs from within a mapper, those pairs
are not a part of sort and shuffle phase. Your combiner is same as reducer
which uses mos, and might be an issue on map side. If I'm to take a guess, mos
writes to a diffe
Hi,
If you want to access certain jobconf parameters in your streaming script,
streaming provides this by setting localized jobconf parameters as system
environment variables, with the "." in parameters replaced by "_" .
To set jobconf parameters for streaming jobs, you can use -D
=
Thanks,
Amo
Hi,
If the file doesn’t exist, java will error out.
For partial skips, o.a.h.mapreduce.Mapper class provides a method run(), which
determines if the end of split is reached and if not, calls map() on your
pair. You may override this method to include flag checks too and if that
fails, the remai
Hi,
The counters are updated as the records are *consumed*, for both mapper and
reducer. Can you confirm if all the values returned by your iterators are
consumed on reduce side? Also, are you having feature of skipping bad records
switched on?
Amogh
On 12/11/09 4:32 AM, "Gang Luo" wrote:
I
ess than map output #.
I didn't use SkipBadRecords class. I think by default the feature is disabled.
So, it should have nothing to do with this.
I do my test using tables of TPC-DS. If I run my job on some 'toy tables' I
make, the statistics is correct.
-Gang
--
Hi,
I believe you need to add the partition file to distributed cache so that all
tasks have it.
The terasort code uses this sampler, you can refer to that if needed.
Amogh
On 12/15/09 5:06 PM, "afarsek" wrote:
Hi,
I'm using the InputSampler.RandomSampler to perform a partition sampling. It
Hi,
1. map.input.file in new API is contentious. It doesn't seem to be seralized in
.20 ( https://issues.apache.org/jira/browse/HADOOP-5973 ) . As of now you can
use ((FileSplit)context.getInputSplit).getPath() , there was a post on this
sometime back.
2. for your own variables in conf, please
Hi,
Can you please let us know your system configuration running hadoop?
The error you see is when the reducer is copying its respective map output into
memory. The parameter mapred.job.shuffle.input.buffer.percent can be
manipulated for this ( a bunch of others will also help you optimize sort
Hi,
The deprecation is due to the new evolving mapreduce ( o.a.h.mapreduce ) APIs.
Old APIs are supported for available distributions. The equivalent of
TextInputFormat is available in new API :
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputForma
Hi,
You said there is no error message, so I would assume your script was shipped
and launched successfully by your perl file. Can you confirm if the error is
not encountered in your c++ code / anything else is logged on the web UI?
Also, you might want to check stream.non.zero.exit.status.is.fai
(Sorry for the spam if any, mails are bouncing back for me)
Hi,
In setup() use this,
FileSplit split = (FileSplit)context.getInputSplit();
split.getPath() will return you the Path.
Hope this helps.
Amogh
On 1/13/10 1:25 AM, "Raymond Jennings III" wrote:
Hi Gang,
I was able to use this on an
ew APIs. I was digging for that answer for awhile. Thanks.
--- On Tue, 1/12/10, Amogh Vasekar wrote:
> From: Amogh Vasekar
> Subject: Re: Is it possible to share a key across maps?
> To: "common-user@hadoop.apache.org" ,
> "raymondj...@yahoo.com" ,
> "co
Hi,
Do your "steps" qualify as separate MR jobs? Then using JobClient APIs should
be more than sufficient for such dependencies.
You can add the whole output directory as input to another one to read all
files, and provide PathFilter to ignore any files you don't want to be
processed, like side
Hi,
When NN is in safe mode, you get a read-only view of the hadoop file system. (
since NN is reconstructing its image of FS )
Use "hadoop dfsadmin -safemode get" to check if in safe mode.
"hadoop dfsadmin -safemode leave" to leave safe mode forcefully. Or use "hadoop
dfsadmin -safemode wait" t
our HDFS.
>>
>> -Thanks for the pointer.
>> Prasen
>>
>> On Tue, Jan 19, 2010 at 10:47 AM, Amogh Vasekar wrote:
>>> Hi,
>>> When NN is in safe mode, you get a read-only view of the hadoop file
>>> system. ( since NN is reconstructing its image
>>Can I tell hadoop to save the map outputs per reducer to be able to inspect
>>what's in them
You can set keep.tasks.files.pattern will save mapper output, set this regex to
match your job/task as need be. But this will eat up a lot of local disk space.
The problem most likely is your data ( o
Hi,
To elaborate a little on Gang's point, the buffer threshold is limited by
io.sort.spill.percent, during which spills are created. If the number of spills
is more than min.num.spills.for.combine, combiner gets invoked on the spills
created before writing to disk.
I'm not sure what exactly you
Hi,
>>now that I can get the splits of a file in hadoop, is it possible to name
>>some splits (not all) as the input to mapper?
I'm assuming when you say "splits of a file in hadoop" you mean splits
generated from the inputformat and not the blocks stored in HDFS.
The [File]InputFormat you use gi
Hi,
I think combiner gets only the keys sort comparator, not the grouping
comparator. So I believe the default grouping is used on combiner, but custom
on reducer.
Here's a relevant snipped of code :
{
super(inputCounter, conf, reporter);
combinerClass = cls;
keyClass = (Class)
Hi,
For global line numbers, you would need to know the ordering within each split
generated from the input file. The standard input formats provide offsets in
splits, so if the records are of equal length you can compute some kind of
numbering.
I remember someone had implemented sequential numb
e-parallel-program.html.
You particular solution won't work, because I need to do additional processing
between the two passes.
--gordon
On Wed, Nov 25, 2009 at 1:50 AM, Amogh Vasekar wrote:
Amogh
On 1/28/10 4:03 PM, "Ravi" wrote:
Thank you Amogh.
On Thu, Jan 28, 2010 at 3:44 PM, Am
Hi Gang,
Yes PathFilters work only on file paths. I meant you can include such type of
logic at split level.
The input format's getSplits() method is responsible for computing and adding
splits to a list container, for which JT initializes mapper tasks. You can
override the getSplits() method to
Hi,
In general, the file split may break the records, its the responsibility of the
record reader to present the record as a whole. If you use standard available
InputFormats, the framework will make sure complete records are presented in
.
Amogh
On 1/29/10 9:04 AM, "Udaya Lakshmi" wrote:
H
Hi,
A shot in the dark, is the conf file in your classpath? If yes, are the
parameters you are trying to override marked final?
Amogh
On 2/4/10 3:18 AM, "Gang Luo" wrote:
Hi,
I am writing script to run whole bunch of jobs automatically. But the
configuration file doesn't seems working. I thi
Hi,
You should not in general write many small files for namenode to perform well:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
To answer your question, you can write them as task side effect files, which
will get propagated to your output directory by hadoop upon successful
com
Hi,
When you submit a job to the cluster, you can control the behavior for blocking
/ return using JobClient's submitJob, runJob methods. It will also let you know
if the job was successful or failed, so you can design your follow up scripts
accordingly.
Amogh
On 2/17/10 11:01 AM, "jiang lic
el
--- On Wed, 2/17/10, Amogh Vasekar wrote:
From: Amogh Vasekar
Subject: Re: Hadoop automatic job status check and notification?
To: "common-user@hadoop.apache.org"
Date: Wednesday, February 17, 2010, 12:44 AM
Hi,
When you submit a job to the cluster, you can control the behavior fo
Hi,
The hadoop meet last year has some very interesting business solutions
discussed:
http://www.cloudera.com/company/press-center/hadoop-world-nyc/
Most of the companies in there have shared their methodology on their blogs /
on slideshare.
One I have handy is:
http://www.slideshare.net/hadoop/p
Hi Ankit,
>>however the the issue that i am facing that I was expecting all the maps to
>>finish before any reduce starts.
This is exactly how it happens, reducers poll map tasks for data and begin user
code only after all maps complete.
>>when is closed function called after every map or after
>> So, considering this situation of loading mixed good and corrupted ".gz"
>> files, how to still get expected results?
Try manipulating the value mapred.max.map.failures.percent to a % of files you
expect to be corrupted / acceptable data skip percent.
Amogh
On 2/21/10 7:17 AM, "jiang licht"
Hi,
Can you please let us know what platform you are running on your hadoop
machines?
For gzip and lzo to work, you need supported hadoop native libraries ( I
remember reading on this somewhere in hadoop wiki :) )
Amogh
On 2/23/10 8:16 AM, "jiang licht" wrote:
I have a pig script. If I don't
hael
--- On Mon, 2/22/10, Amogh Vasekar wrote:
From: Amogh Vasekar
Subject: Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map
output
To: "common-user@hadoop.apache.org"
Date: Monday, February 22, 2010, 11:27 PM
Hi,
Can you please let us know what platform you are ru
Hi,
Can you let us know what is the value for :
Map input records
Map spilled records
Map output bytes
Is there any side effect file written?
Thanks,
Amogh
On 2/23/10 8:57 PM, "Tim Kiefer" wrote:
No... 900GB is in the map column. Reduce adds another ~70GB of
FILE_BYTES_WRITTEN and the total co
not perform any additional file writing besides
the context.write() for the intermediate records.
Thanks, Tim
Am 24.02.2010 05:28, schrieb Amogh Vasekar:
> Hi,
> Can you let us know what is the value for :
> Map input records
> Map spilled records
> Map output bytes
> Is there any side effect file written?
>
> Thanks,
> Amogh
>
1 - 100 of 124 matches
Mail list logo