RE: Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Amogh Vasekar
Regarding the timeout, I think the limiting value can be set in seconds using the config parameters. Thanks, Amogh -Original Message- From: Aaron Kimball [mailto:aa...@cloudera.com] Sent: Friday, July 10, 2009 3:02 AM To: common-user@hadoop.apache.org Subject: Re: Hadoop: Reduce exceedi

RE: Question regarding Map side Join

2009-07-13 Thread Amogh Vasekar
Yes it is. However, I assume file 2 is "comparatively" small to be distributed across all computing nodes without much delay, else the whole point of map side join is defeated. If keys in file 2 are unique, it is a simple lookup you need to implement. Else iterate over them to implement the joi

RE: Question about job distribution

2009-07-14 Thread Amogh Vasekar
Confused. What do you mean by "query be distributed over all datanodes or just 1 node" . If your data is small enough so that it fits in just one block ( and replicated by hadoop ), then just one task will be run ( assuming default input split). If the data is spread across multiple blocks, you

RE: best way to set memory

2009-07-21 Thread Amogh Vasekar
If you need to set the java_options for mem., you can do this via configure in your MR job. -Original Message- From: Fernando Padilla [mailto:f...@alum.mit.edu] Sent: Wednesday, July 22, 2009 9:11 AM To: common-user@hadoop.apache.org Subject: best way to set memory So.. I want to have d

RE: best way to set memory

2009-07-22 Thread Amogh Vasekar
each daemon-type.. bin/hadoop-daemon.sh start namenode bin/hadoop-daemon.sh start datanode bin/hadoop-daemon.sh start secondarynamenode bin/hadoop-daemon.sh start jobtracker bin/hadoop-daemon.sh start tasktracker Amogh Vasekar wrote: > If you need to set the java_options for mem., you can do

RE: Output of a Reducer as a zip file?

2009-07-22 Thread Amogh Vasekar
Does MultipleOutputFormat suffice? Cheers! Amogh -Original Message- From: Mark Kerzner [mailto:markkerz...@gmail.com] Sent: Thursday, July 23, 2009 6:24 AM To: core-u...@hadoop.apache.org Subject: Output of a Reducer as a zip file? Hi, my output consists of a number of binary files, cor

RE: Why is single reducer called twice?

2009-07-27 Thread Amogh Vasekar
>> the reducer is called a >>second time to do nothing, before all is done Can you elaborate please? Amogh -Original Message- From: Mark Kerzner [mailto:markkerz...@gmail.com] Sent: Monday, July 27, 2009 8:51 PM To: core-u...@hadoop.apache.org Subject: Why is single reducer called twice

RE: map side join

2009-07-31 Thread Amogh Vasekar
This is particularly useful if your input is the output of another MR job, else is a killer. You may want to write your own mapper in case one of the files to be joined is small enough to fit in memory / can be handled in splits. Thanks, Amogh -Original Message- From: Jason Venner [mail

RE: Running 145K maps, zero reduces- does Hadoop scale?

2009-07-31 Thread Amogh Vasekar
What is the use case for this? Especially since you have 0 reducers. Thanks, Amogh -Original Message- From: Saptarshi Guha [mailto:saptarshi.g...@gmail.com] Sent: Friday, July 31, 2009 12:08 PM To: core-u...@hadoop.apache.org Subject: Re: Running 145K maps, zero reduces- does Hadoop scal

RE: setting parameters for a hadoop job

2009-08-02 Thread Amogh Vasekar
Ideally should be done using generic options parser. Please have a look at ToolRunner for more info. Thanks, Amogh -Original Message- From: Mark Kerzner [mailto:markkerz...@gmail.com] Sent: Saturday, August 01, 2009 2:47 AM To: common-user@hadoop.apache.org Subject: setting parameters f

RE: :!

2009-08-03 Thread Amogh Vasekar
Maybe I'm missing the point, but in terms of execution performance benefit, what does copying to dfs and then compressing to be fed to a map/reduce job provide? Isn't it better to compress "offline" / outside latency window and make available on dfs? Also, your mapreduce program will launch one

RE: Counting no. of keys.

2009-08-03 Thread Amogh Vasekar
Have you had a look at the reporter.counter hadoop provides? I think it might be helpful in your case, where in you can locally aggregate for each map task and then push it to global counter. -Original Message- From: Zhong Wang [mailto:wangzhong@gmail.com] Sent: Monday, August 03, 2

RE: Some tasks fail to report status between the end of the map and the beginning of the merge

2009-08-05 Thread Amogh Vasekar
10 mins reminds me of parameter mapred.task.timeout . This is configurable. Or alternatively you might just do a sysout to let tracker know of its existence ( not an ideal solution though ) Thanks, Amogh -Original Message- From: Mathias De Maré [mailto:mathias.dem...@gmail.com] Sent: We

RE: utilizing all cores on single-node hadoop

2009-08-17 Thread Amogh Vasekar
While setting mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum, please consider the memory usage your application might have since all tasks will be competing for the same and might reduce overall performance. Thanks, Amogh -Original Message- From: Harish

RE: Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

2009-08-20 Thread Amogh Vasekar
AFAIK, hadoop.tmp.dir : Used by NN and DN for directory listings and metadata ( don't have much info on this ) java.opts & ulimit : ulimit defines the maximum limit of virtual mem for task launched. java.opts is the amount of memory reserved for a task. When setting you need to account for memo

RE: passing job arguments as an xml file

2009-08-20 Thread Amogh Vasekar
Hi, GenericOptionsParser is customized only for Hadoop specific params : * GenericOptionsParser recognizes several standarad command * line arguments, enabling applications to easily specify a namenode, a * jobtracker, additional configuration resources etc. Ideally, all params must be passe

RE: MR job scheduler

2009-08-20 Thread Amogh Vasekar
I'm not sure that is the case with Hadoop. I think its assigning reduce task to an available tasktracker at any instant; Since a reducer polls JT for completed maps. And if it were the case as you said, a reducer wont be initialized until all maps have completed , after which copy phase would st

RE: MR job scheduler

2009-08-20 Thread Amogh Vasekar
PM To: common-user@hadoop.apache.org Subject: Re: MR job scheduler Amogh i think Reduce phase starts only when all the map phases are completed . Because it needs all the values corresponding to a particular key! 2009/8/21 Amogh Vasekar > I'm not sure that is the case with Hadoop. I t

RE: MR job scheduler

2009-08-21 Thread Amogh Vasekar
ansferring data across the network(because already many values to that key are on that machine where the map phase completed).. 2009/8/21 Amogh Vasekar > Yes, but the copy phase starts with the initialization for a reducer, after > which it would keep polling for completed map tasks to fetc

RE: How to speed up the copy phrase?

2009-08-24 Thread Amogh Vasekar
Maybe look at mapred.reduce.parallel.copies property to speed it up...I don't see as to why transfer speed be configured via params, and I'm think hadoop wont be messing with that. Thanks, Amogh -Original Message- From: yang song [mailto:hadoop.ini...@gmail.com] Sent: Monday, August 24

RE: Hadoop streaming: How is data distributed from mappers to reducers?

2009-08-24 Thread Amogh Vasekar
Hadoop will make sure that every pair with same key will land up in same reducer and consumed in a single reduce instance. -Original Message- From: Nipun Saggar [mailto:nipun.sag...@gmail.com] Sent: Tuesday, August 25, 2009 10:41 AM To: common-user@hadoop.apache.org Subject: Re: Hadoop

RE: difference between mapper and map runnable

2009-08-27 Thread Amogh Vasekar
Hi, Mapper is used to process the pair passed to it, MapRunnable is an interface, when implemented is responsible for generating a conforming pair and pass it to Mapper. Cheers! Amogh -Original Message- From: Rakhi Khatwani [mailto:rkhatw...@gmail.com] Sent: Thursday, August 27, 2009

RE: Datanode high memory usage

2009-08-31 Thread Amogh Vasekar
This wont change the daemon configs. Hadoop by default allocates 1000MB of memory for each of its daemons, which can be controlled by HADOOP_HEAPSIZE, HADOOP_NAMENODE_OPTS, HADOOP_TASKTRACKER_OPTS in the hadoop script. However, there was a discussion on this sometime back wherein these options w

RE: Datanode high memory usage

2009-09-01 Thread Amogh Vasekar
y generated by JIRA. - You can reply to this email to add a comment to the issue online. --- Cheers! Amogh -Original Message- From: Stas Oskin [mailto:stas.os...@gmail.com] Sent: Tuesday, September 01, 2009 2:31 PM To: common-user@hadoop.apache.org Subject: Re: Datanode high me

RE: DistributedCache purgeCache()

2009-09-02 Thread Amogh Vasekar
AFAIK, releaseCache only works on cleaning reference to your file. Try using deletecache in synchronized manner. Thanks, Amogh -Original Message- From: #YONG YONG CHENG# [mailto:aarnc...@pmail.ntu.edu.sg] Sent: Thursday, September 03, 2009 8:50 AM To: common-user@hadoop.apache.org Subje

RE: multi core nodes

2009-09-04 Thread Amogh Vasekar
Before setting the task limits, do take into account the memory considerations ( many archive posts on this can be found ). Also, your tasktracker and datanode daemons will run on that machine as well, so you might want to set aside some processing power for that. Cheers! Amogh -Original M

RE: Some issues!

2009-09-04 Thread Amogh Vasekar
Have a look at jobclient, it should suffice. Cheers! Amogh -Original Message- From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com] Sent: Friday, September 04, 2009 9:15 PM To: common-user@hadoop.apache.org Subject: Re: Some issues! Hey , I have one more doubt , Suppose

RE: DistributedCache purgeCache()

2009-09-07 Thread Amogh Vasekar
t: RE: DistributedCache purgeCache() Thanks for your swift response. But where can I find deletecache()? Thanks. -Original Message- From: Amogh Vasekar [mailto:am...@yahoo-inc.com] Sent: Thu 9/3/2009 2:44 PM To: common-user@hadoop.apache.org Subject: RE: DistributedCache purgeCache()

RE: Hadoop Input Files Directory

2009-09-13 Thread Amogh Vasekar
An alternative will be to use hadoop fs apis to recursively list file status and pass that as the input files . This is slightly complicated but will give you more control and might help while debugging as well. Just a thought. Thanks, Amogh -Original Message- From: Amandeep Khurana [ma

RE: How to report the status

2009-09-14 Thread Amogh Vasekar
Hi, Ran into a similar issue : https://issues.apache.org/jira/browse/HBASE-1791 Not sure if what you are experiencing is similar. Context.progress() "should" work. One ugly hack would be to set the timeout value to high number. But I would wait for a better answer before doing that. Thanks, Amog

JVM reuse

2009-09-15 Thread Amogh Vasekar
Hi All, Regarding the JVM reuse feature incorporated, it says reuse is generally recommended for streaming and pipes jobs. I'm a little unclear on this and any pointers will be appreciated. Also, in what scenarios will this feature be helpful for java mapred jobs? Thanks, Amogh

RE: about hadoop jvm allocation in job excution

2009-09-15 Thread Amogh Vasekar
Hi, Funny enough was looking at it just yesterday. http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Task+JVM+Reuse Thanks, Amogh -Original Message- From: Zhimin [mailto:wan...@cs.umb.edu] Sent: Tuesday, September 15, 2009 10:53 PM To: core-u...@hadoop.apache.org Subject

RE: Program crashed when volume of data getting large

2009-09-23 Thread Amogh Vasekar
Hi, Please check the namenode heap usage. Your cluster may be having too many files to handle / too little free space. It is generally available in the UI. This is one of the causes I have seen for the Timeout. Amogh -Original Message- From: Kunsheng Chen [mailto:ke...@yahoo.com] Sent:

RE: Best Idea to deal with following situation

2009-09-29 Thread Amogh Vasekar
Along with partitioner, try to plug in a combiner. It would provide significant performance gains. Not sure about the algo you use, but might have to tweak that a little to facilitate a combiner. Thanks, Amogh -Original Message- From: Chandraprakash Bhagtani [mailto:cpbhagt...@gmail.com

RE: Distributed cache - are files unique per job?

2009-09-29 Thread Amogh Vasekar
I believe framework checks timestamps on HDFS for marking an already available copy of the file valid or invalid, since the archived files are not cleaned up till a certain du limit is reached, and no apis for cleanup available. There was a thread on this some time back on the list. Amogh

RE: Easiest way to pass dynamic variable to Map Class

2009-10-05 Thread Amogh Vasekar
Hi, I guess configure is now setup(), and using toolrunner can create a configuration / context to mimic the required behavior. Thanks, Amogh -Original Message- From: Amandeep Khurana [mailto:ama...@gmail.com] Sent: Tuesday, October 06, 2009 5:43 AM To: common-user@hadoop.apache.org Su

RE: How can I assign the same mapper class with different data?

2009-10-05 Thread Amogh Vasekar
Hi Huang, Haven't worked with Hbase but in general, If you want to have control over what data split to go as a whole to mapper, easiest way is to compress that split in single file; making as many split files as needed. If you need to know what file is currently being processed, you can use ma

RE: Having multiple values in Value field

2009-10-06 Thread Amogh Vasekar
>> You can always pass them as comma delimited strings Which would be pretty expensive per right? Would avro be looking into solving such problems? Amogh -Original Message- From: Jason Venner [mailto:jason.had...@gmail.com] Sent: Tuesday, October 06, 2009 11:33 AM To: common-user@hadoo

Re: How to get IP address of the machine where map task runs

2009-10-14 Thread Amogh Vasekar
For starters look at any monitoring tool like vaidya, hadoop UI ( ganglia too, haven't read much on it though ). Not sure if you need this for debugging purposes or for some other real-time app.. You should be able to get info on localhost of each of your map tasks in a pretty straightforward wa

Re: help with Hadoop custom Writable types implementation

2009-10-14 Thread Amogh Vasekar
Hi, AFAIK readline is not recommended on DataInput types. Also, look into writableutils to see if something there may be used. Hope this helps. Amogh On 10/15/09 9:31 AM, "z3r0c001" wrote: I'm trying to implement Writable interface. but not sure how to serialize/write/read data from nested ob

Re: How to get IP address of the machine where map task runs

2009-10-14 Thread Amogh Vasekar
g Van Nguyen Dinh" wrote: Thanks Amogh. For my application, I want each map task reports to me where it's running. However, I have no idea how to use Java Inetaddress APIs to get that info. Could you explain more? Van On Wed, Oct 14, 2009 at 2:16 PM, Amogh Vasekar wrote: > For st

Re: proper way to configure classes required by mapper job

2009-10-19 Thread Amogh Vasekar
Hi, Check the distributed cache APIs, it provides various functionalities to distribute and add jars to classpath on compute machines. Amogh On 10/19/09 3:38 AM, "yz5od2" wrote: Hi, What is the preferred method to distribute the classes (in various Jars) to my Hadoop instances, that are requi

Re: Hadoop dfs can't allocate memory with enough hard disk space when data gets huge

2009-10-19 Thread Amogh Vasekar
Hi, It would be more helpful if you provide the exact error here. Also, hadoop uses the local FS to store intermediate data, along with HDFS for final output. If your job is memory intensive, try limiting the number of tasks you are running in parallel on a machine. Amogh On 10/19/09 8:27 AM,

Re: How to skip fail map to done the job

2009-10-20 Thread Amogh Vasekar
For skipping failed tasks try : mapred.max.map.failures.percent Amogh On 10/21/09 8:58 AM, "梁景明" wrote: hi, I use hadoop0.20 and 8 nodes, there is a job that has 130 map to run ,and completed 128 map, but only 2 map fail ,and its fail in my case is accepted ,but the job fail ,the last 128 map a

Re: Can I have multiple reducers?

2009-10-23 Thread Amogh Vasekar
Hi, On what parameters does the output key of your (first) reducer depend? Amogh On 10/23/09 8:24 AM, "Aaron Kimball" wrote: If you need another shuffle after your first reduce pass, then you need a second MapReduce job to run after the first one. Just use an IdentityMapper. This is a reasonab

Re: How To Pass Parameters To Mapper Through Main Method

2009-10-25 Thread Amogh Vasekar
Hi, Many options available here. You can use jobconf (0.18 ) / context.conf (0.20) to pass these lines across all tasks ( assuming the size isnt relatively large ) and use configure / setup to retrieve these.. Or use distributed cache to read a file containing these lines ( possibly with jvm reu

Re: Does the map task push map output to reduce task or reduce task pull it from map task

2009-10-26 Thread Amogh Vasekar
Hi, Reduce task looks at map tasks for the partition it requires, and pulls it ( the number of parallel copies is controlled by reduce.parallel.copies ). As partitions are taken in by reduce task, it performs a merge sort, this forms your S&S phase. Typically your mappers / reducers are O(n) ,

Re: Problem to create sequence file for

2009-10-27 Thread Amogh Vasekar
Hi Bhushan, If splitting input files is an option, why don't you let hadoop do this for you? If need be you may use a custom input format and sequencefile*outputformat. Amogh On 10/27/09 7:55 PM, "bhushan_mahale" wrote: Hi Jason, Thanks for the reply. The string is the entire content of the

Re: Distribution of data in nodes with different storage capacity

2009-10-28 Thread Amogh Vasekar
Hi, Rebalancer should help you : http://issues.apache.org/jira/browse/HADOOP-1652 Amogh On 10/28/09 2:54 PM, "Vibhooti Verma" wrote: Hi All, We are facing the issue with distribution of data in a cluster where nodes have differnt storage capacity. We have 4 nodes with 100G capacity and 1 node w

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Amogh Vasekar
Hi, Quick questions... Are you creating too many small files? Are there any task side files being created? Is the heap for NN having enough space to list metadata? Any details on its general health will probably be helpful to people on the list. Amogh On 11/2/09 2:02 PM, "Zhang Bingjun (Eddy)

Re: Multiple Input Paths

2009-11-02 Thread Amogh Vasekar
Mark, Set-up for a mapred job consumes a considerable amount of time and resources and so, if possible a single job is preferred. You can add multiple paths to your job, and if you need different processing logic depending upon the input being consumed, you can use parameter map.input.file in yo

Re: Multiple Input Paths

2009-11-03 Thread Amogh Vasekar
Hi Mark, A future release of Hadoop will have a MultipleInputs class, akin to MultipleOutputs. This would allow you to have a different inputformat, mapper depending on the path you are getting the split from. It uses special Delegating[mapper/input] classes to resolve this. I understand backpor

Re: DFS block size

2009-11-14 Thread Amogh Vasekar
Replies inline. On 11/14/09 9:55 PM, "Hrishikesh Agashe" wrote: Hi, Default DFS block size is 64 MB. Does this mean that if I put file less than 64 MB on HDFS, it will not be divided any further? --Yes, file will be stored in single block per replica. I have lots and lots if XMLs and I wo

Re: architecture help

2009-11-15 Thread Amogh Vasekar
>> I would like the connection management to live separately >>from the mapper instances per node. The JVM reuse option in Hadoop might be helpful for you in this case. Amogh On 11/16/09 6:22 AM, "yz5od2" wrote: Hi, a) I have a Mapper ONLY job, the job reads in records, then parses them apart.

Re: About Distribute Cache

2009-11-15 Thread Amogh Vasekar
And, a relatively high replication factor on files to be distributed will help :) Amogh On 11/16/09 9:05 AM, "Ed Kohlwey" wrote: Hi, What you can fit in distributed cache generally depends on the available disk space on your nodes. With most clusters 300 mb will not be a problem, but it depen

Re: How to handle imbalanced data in hadoop ?

2009-11-18 Thread Amogh Vasekar
Hi, This is the time for all three phases of reducer right? I think its due to the constant spilling for a single key to disk since the map partitions couldn't be held in-mem due to buffer limit. Did the other reducer have numerous keys with low number of values ( ie smaller partitions? ) Thanks

Re: new MR API:MutilOutputFormat

2009-11-18 Thread Amogh Vasekar
MultipleOutputFormat and MOS are to be merged : http://issues.apache.org/jira/browse/MAPREDUCE-370 Amogh On 11/18/09 12:03 PM, "Y G" wrote: in the old MR API ,there is MutilOutputFormat class which i can use to custom the reduce output file name. it's very useful for me. but i can't find it i

Re: execute multiple MR jobs

2009-11-18 Thread Amogh Vasekar
Hi, JobClient (.18) / Job(.20) class apis should help you achieve this. Amogh On 11/19/09 1:40 AM, "Gang Luo" wrote: HI all, I am going to execute multiple mapreduce jobs in sequence, but whether or not to execute a job in that sequence could not be determined beforehand, but depend on the r

Re: Saving Intermediate Results from the Mapper

2009-11-22 Thread Amogh Vasekar
Hi, keep.tasks.files.pattern is what you need, as the name suggests its a pattern match on intermediate outputs generated. Wrt to copying map data to hdfs, your mappers close() method should help you achieve this, but might slow up your tasks. Amogh On 11/23/09 8:08 AM, "Jeff Zhang" wrote:

Re: Saving Intermediate Results from the Mapper

2009-11-24 Thread Amogh Vasekar
Hi, I'm not sure if this will apply to your case since i'm not aware of the common part of job2:mapper and job3:mapper but would like to give it a shot. The whole process can be combined into a single mapred job. The mapper will read a record and process till the "saved data part" , then for each

Re: Hadoop Performance

2009-11-24 Thread Amogh Vasekar
Hi, For "near" real time performance you may try Hbase. I had read about Streamy doing this, and their hadoop-world-nyc ppt is available on their blog: http://devblog.streamy.com/2009/07/24/streamy-hadoop-summit-hbase-goes-realtime/ Amogh On 11/25/09 1:31 AM, "onur ascigil" wrote: Thanks f

Re: part-00000.deflate as output

2009-11-25 Thread Amogh Vasekar
Hi, ".deflate" is the default compression codec used when parameter to generate compressed output is true ( mapred.output.compress ). You may set the codec to be used via mapred.output.compression.codec, some commonly used are available in hadoop.io.compress package... Amogh On 11/26/09 11:03

Re: The name of the current input file during a map

2009-11-25 Thread Amogh Vasekar
Conf.get(map.input.file) is what you need. Amogh On 11/26/09 12:35 PM, "Saptarshi Guha" wrote: Hello, I have a set of input files part-r-* which I will pass through another map(no reduce). the part-r-* files consist of key, values, keys being small, values fairly large(MB's) I would like to

Re: The name of the current input file during a map

2009-11-26 Thread Amogh Vasekar
Configuration(); System.out.println("mapred.input.file="+cfg.get("mapred.input.file")); displays null, so maybe this fell out by mistake in the api change? Regards Saptarshi On Thu, Nov 26, 2009 at 2:13 AM, Saptarshi Guha wrote: > Thank you. > Regards > Saptarshi > &

Re: Problem with mapred.job.reuse.jvm.num.tasks

2009-11-30 Thread Amogh Vasekar
Hi, Task slots reuse JVM over the course of entire job right? Specifically, would like to point to : http://issues.apache.org/jira/browse/MAPREDUCE-453?focusedCommentId=12619492&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12619492 Thanks, Amogh On 11/30/09 5:44

Re: How can I change the mapreduce output coder?

2009-12-01 Thread Amogh Vasekar
Hi, What are your intermediate & output class formats? “Text” format is inherently UTF-8 encoded. If you want end-to-end processing to be via gbk encoding, you may have to write a custom writable type. Amogh On 11/30/09 7:09 PM, "郭鹏" wrote: > I know the default output coder is utf-8, but how

Re: Hadoop with Multiple Inpus and Outputs

2009-12-03 Thread Amogh Vasekar
Hi, Please try removing the combiner and running. I know that if you use multiple outputs from within a mapper, those pairs are not a part of sort and shuffle phase. Your combiner is same as reducer which uses mos, and might be an issue on map side. If I'm to take a guess, mos writes to a diffe

Re: only one reduce task?

2009-12-03 Thread Amogh Vasekar
Hi, If you want to access certain jobconf parameters in your streaming script, streaming provides this by setting localized jobconf parameters as system environment variables, with the "." in parameters replaced by "_" . To set jobconf parameters for streaming jobs, you can use -D = Thanks, Amo

Re: Re: return in map

2009-12-06 Thread Amogh Vasekar
Hi, If the file doesn’t exist, java will error out. For partial skips, o.a.h.mapreduce.Mapper class provides a method run(), which determines if the end of split is reached and if not, calls map() on your pair. You may override this method to include flag checks too and if that fails, the remai

Re: Re: Re: Re: map output not euqal t o reduce input

2009-12-10 Thread Amogh Vasekar
Hi, The counters are updated as the records are *consumed*, for both mapper and reducer. Can you confirm if all the values returned by your iterators are consumed on reduce side? Also, are you having feature of skipping bad records switched on? Amogh On 12/11/09 4:32 AM, "Gang Luo" wrote: I

Re: Re: Re: Re: Re: map output not euqal to reduce input

2009-12-14 Thread Amogh Vasekar
ess than map output #. I didn't use SkipBadRecords class. I think by default the feature is disabled. So, it should have nothing to do with this. I do my test using tables of TPC-DS. If I run my job on some 'toy tables' I make, the statistics is correct. -Gang --

Re: File _partition.lst does not exist.

2009-12-15 Thread Amogh Vasekar
Hi, I believe you need to add the partition file to distributed cache so that all tasks have it. The terasort code uses this sampler, you can refer to that if needed. Amogh On 12/15/09 5:06 PM, "afarsek" wrote: Hi, I'm using the InputSampler.RandomSampler to perform a partition sampling. It

Re: Configuration.set/Configuration.get now working

2010-01-05 Thread Amogh Vasekar
Hi, 1. map.input.file in new API is contentious. It doesn't seem to be seralized in .20 ( https://issues.apache.org/jira/browse/HADOOP-5973 ) . As of now you can use ((FileSplit)context.getInputSplit).getPath() , there was a post on this sometime back. 2. for your own variables in conf, please

Re: What can cause: Map output copy failure

2010-01-08 Thread Amogh Vasekar
Hi, Can you please let us know your system configuration running hadoop? The error you see is when the reducer is copying its respective map output into memory. The parameter mapred.job.shuffle.input.buffer.percent can be manipulated for this ( a bunch of others will also help you optimize sort

Re: isSplitable() deprecated

2010-01-08 Thread Amogh Vasekar
Hi, The deprecation is due to the new evolving mapreduce ( o.a.h.mapreduce ) APIs. Old APIs are supported for available distributions. The equivalent of TextInputFormat is available in new API : http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputForma

Re: How can I run a executable file from streaming.

2010-01-10 Thread Amogh Vasekar
Hi, You said there is no error message, so I would assume your script was shipped and launched successfully by your perl file. Can you confirm if the error is not encountered in your c++ code / anything else is logged on the web UI? Also, you might want to check stream.non.zero.exit.status.is.fai

Re: Is it possible to share a key across maps?

2010-01-12 Thread Amogh Vasekar
(Sorry for the spam if any, mails are bouncing back for me) Hi, In setup() use this, FileSplit split = (FileSplit)context.getInputSplit(); split.getPath() will return you the Path. Hope this helps. Amogh On 1/13/10 1:25 AM, "Raymond Jennings III" wrote: Hi Gang, I was able to use this on an

Re: Is it possible to share a key across maps?

2010-01-13 Thread Amogh Vasekar
ew APIs. I was digging for that answer for awhile. Thanks. --- On Tue, 1/12/10, Amogh Vasekar wrote: > From: Amogh Vasekar > Subject: Re: Is it possible to share a key across maps? > To: "common-user@hadoop.apache.org" , > "raymondj...@yahoo.com" , > "co

Re: Is it always called part-00000?

2010-01-18 Thread Amogh Vasekar
Hi, Do your "steps" qualify as separate MR jobs? Then using JobClient APIs should be more than sufficient for such dependencies. You can add the whole output directory as input to another one to read all files, and provide PathFilter to ignore any files you don't want to be processed, like side

Re: rmr: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /op. Name node is in safe mode.

2010-01-18 Thread Amogh Vasekar
Hi, When NN is in safe mode, you get a read-only view of the hadoop file system. ( since NN is reconstructing its image of FS ) Use "hadoop dfsadmin -safemode get" to check if in safe mode. "hadoop dfsadmin -safemode leave" to leave safe mode forcefully. Or use "hadoop dfsadmin -safemode wait" t

Re: rmr: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /op. Name node is in safe mode.

2010-01-19 Thread Amogh Vasekar
our HDFS. >> >> -Thanks for the pointer. >> Prasen >> >> On Tue, Jan 19, 2010 at 10:47 AM, Amogh Vasekar wrote: >>> Hi, >>> When NN is in safe mode, you get a read-only view of the hadoop file >>> system. ( since NN is reconstructing its image

Re: Debugging Partitioner problems

2010-01-20 Thread Amogh Vasekar
>>Can I tell hadoop to save the map outputs per reducer to be able to inspect >>what's in them You can set keep.tasks.files.pattern will save mapper output, set this regex to match your job/task as need be. But this will eat up a lot of local disk space. The problem most likely is your data ( o

Re: When exactly is combiner invoked?

2010-01-27 Thread Amogh Vasekar
Hi, To elaborate a little on Gang's point, the buffer threshold is limited by io.sort.spill.percent, during which spills are created. If the number of spills is more than min.num.spills.for.combine, combiner gets invoked on the spills created before writing to disk. I'm not sure what exactly you

Re: fine granularity operation on HDFS

2010-01-27 Thread Amogh Vasekar
Hi, >>now that I can get the splits of a file in hadoop, is it possible to name >>some splits (not all) as the input to mapper? I'm assuming when you say "splits of a file in hadoop" you mean splits generated from the inputformat and not the blocks stored in HDFS. The [File]InputFormat you use gi

Re: Question on GroupingComparatorClass

2010-01-27 Thread Amogh Vasekar
Hi, I think combiner gets only the keys sort comparator, not the grouping comparator. So I believe the default grouping is used on combiner, but custom on reducer. Here's a relevant snipped of code : { super(inputCounter, conf, reporter); combinerClass = cls; keyClass = (Class)

Re: Input file format doubt

2010-01-28 Thread Amogh Vasekar
Hi, For global line numbers, you would need to know the ordering within each split generated from the input file. The standard input formats provide offsets in splits, so if the records are of equal length you can compute some kind of numbering. I remember someone had implemented sequential numb

Re: Input file format doubt

2010-01-28 Thread Amogh Vasekar
e-parallel-program.html. You particular solution won't work, because I need to do additional processing between the two passes. --gordon On Wed, Nov 25, 2009 at 1:50 AM, Amogh Vasekar wrote: Amogh On 1/28/10 4:03 PM, "Ravi" wrote: Thank you Amogh. On Thu, Jan 28, 2010 at 3:44 PM, Am

Re: fine granularity operation on HDFS

2010-01-28 Thread Amogh Vasekar
Hi Gang, Yes PathFilters work only on file paths. I meant you can include such type of logic at split level. The input format's getSplits() method is responsible for computing and adding splits to a list container, for which JT initializes mapper tasks. You can override the getSplits() method to

Re: File split query

2010-01-28 Thread Amogh Vasekar
Hi, In general, the file split may break the records, its the responsibility of the record reader to present the record as a whole. If you use standard available InputFormats, the framework will make sure complete records are presented in . Amogh On 1/29/10 9:04 AM, "Udaya Lakshmi" wrote: H

Re: configuration file

2010-02-04 Thread Amogh Vasekar
Hi, A shot in the dark, is the conf file in your classpath? If yes, are the parameters you are trying to override marked final? Amogh On 2/4/10 3:18 AM, "Gang Luo" wrote: Hi, I am writing script to run whole bunch of jobs automatically. But the configuration file doesn't seems working. I thi

Re: Is it possible to write each key-value pair emitted by the reducer to a different output file

2010-02-04 Thread Amogh Vasekar
Hi, You should not in general write many small files for namenode to perform well: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ To answer your question, you can write them as task side effect files, which will get propagated to your output directory by hadoop upon successful com

Re: Hadoop automatic job status check and notification?

2010-02-16 Thread Amogh Vasekar
Hi, When you submit a job to the cluster, you can control the behavior for blocking / return using JobClient's submitJob, runJob methods. It will also let you know if the job was successful or failed, so you can design your follow up scripts accordingly. Amogh On 2/17/10 11:01 AM, "jiang lic

Re: Hadoop automatic job status check and notification?

2010-02-17 Thread Amogh Vasekar
el --- On Wed, 2/17/10, Amogh Vasekar wrote: From: Amogh Vasekar Subject: Re: Hadoop automatic job status check and notification? To: "common-user@hadoop.apache.org" Date: Wednesday, February 17, 2010, 12:44 AM Hi, When you submit a job to the cluster, you can control the behavior fo

Re: basic hadoop job help

2010-02-18 Thread Amogh Vasekar
Hi, The hadoop meet last year has some very interesting business solutions discussed: http://www.cloudera.com/company/press-center/hadoop-world-nyc/ Most of the companies in there have shared their methodology on their blogs / on slideshare. One I have handy is: http://www.slideshare.net/hadoop/p

Re: Pass the TaskId from map to Reduce

2010-02-18 Thread Amogh Vasekar
Hi Ankit, >>however the the issue that i am facing that I was expecting all the maps to >>finish before any reduce starts. This is exactly how it happens, reducers poll map tasks for data and begin user code only after all maps complete. >>when is closed function called after every map or after

Re: Unexpected empty result problem (zero-sized part-### files)?

2010-02-21 Thread Amogh Vasekar
>> So, considering this situation of loading mixed good and corrupted ".gz" >> files, how to still get expected results? Try manipulating the value mapred.max.map.failures.percent to a % of files you expect to be corrupted / acceptable data skip percent. Amogh On 2/21/10 7:17 AM, "jiang licht"

Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-22 Thread Amogh Vasekar
Hi, Can you please let us know what platform you are running on your hadoop machines? For gzip and lzo to work, you need supported hadoop native libraries ( I remember reading on this somewhere in hadoop wiki :) ) Amogh On 2/23/10 8:16 AM, "jiang licht" wrote: I have a pig script. If I don't

Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-22 Thread Amogh Vasekar
hael --- On Mon, 2/22/10, Amogh Vasekar wrote: From: Amogh Vasekar Subject: Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output To: "common-user@hadoop.apache.org" Date: Monday, February 22, 2010, 11:27 PM Hi, Can you please let us know what platform you are ru

Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Amogh Vasekar
Hi, Can you let us know what is the value for : Map input records Map spilled records Map output bytes Is there any side effect file written? Thanks, Amogh On 2/23/10 8:57 PM, "Tim Kiefer" wrote: No... 900GB is in the map column. Reduce adds another ~70GB of FILE_BYTES_WRITTEN and the total co

Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-24 Thread Amogh Vasekar
not perform any additional file writing besides the context.write() for the intermediate records. Thanks, Tim Am 24.02.2010 05:28, schrieb Amogh Vasekar: > Hi, > Can you let us know what is the value for : > Map input records > Map spilled records > Map output bytes > Is there any side effect file written? > > Thanks, > Amogh >

  1   2   >