Re: Identification of mapper slots

2013-10-14 Thread Rahul Jain
I assume you know the tradeoff here: If you do depend upon mapper slot # in your implementation to speed it up, you are losing on code portability in long term That said, one way to achieve this is to use the JobConf API: int partition = jobConf.getInt(JobContext.TASK_PARTITION, -1); The

Re: Issue: Max block location exceeded for split error when running hive

2013-09-19 Thread Rahul Jain
I am assuming you have looked at this already: https://issues.apache.org/jira/browse/MAPREDUCE-5186 You do have a workaround here to increase *mapreduce.job.max.split.locations *value in hive configuration, or do we need more than that here ? -Rahul On Thu, Sep 19, 2013 at 11:00 AM, Murtaza

Re: Issue: Max block location exceeded for split error when running hive

2013-09-19 Thread Rahul Jain
of data, then a different day I want to run against 30 days? On Thu, Sep 19, 2013 at 3:11 PM, Rahul Jain rja...@gmail.com wrote: I am assuming you have looked at this already: https://issues.apache.org/jira/browse/MAPREDUCE-5186 You do have a workaround here to increase

Re: What happens when you have fewer input files than mapper slots?

2013-03-19 Thread Rahul Jain
Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ?? For MRv2 (yarn): you can pretty much achieve this using: yarn.nodemanager.resource.memory-mb (system wide setting) and mapreduce.map.memory.mb (job level setting) e.g. if yarn.nodemanager.resource.memory-mb=100 and

Re: Time taken for launching Application Master

2013-01-20 Thread Rahul Jain
Check your node manager logs to understand the bottleneck first. When we had a similar issue on recent version of hadoop, which includes fix for MAPREDUCE-4068: we rearranged our job jar file to reduce time spent on 'expanding' the job jar file by the node manager(s). -Rahul On Sun, Jan 20, 2013

Re: YARN Pi example job stuck at 0%(No MR tasks are started by ResourceManager)

2012-07-30 Thread Rahul Jain
The inability to look at map-reduce logs for failed logs is due to number of open issues in yarn; see my recent comment here: https://issues.apache.org/jira/browse/MAPREDUCE-4428?focusedCommentId=13412995page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13412995 You

Re: YARN Pi example job stuck at 0%(No MR tasks are started by ResourceManager)

2012-07-30 Thread Rahul Jain
, Rahul Jain rja...@gmail.com wrote: The inability to look at map-reduce logs for failed logs is due to number of open issues in yarn; see my recent comment here: https://issues.apache.org/jira/browse/MAPREDUCE-4428?focusedCommentId=13412995page

Re: Hive Thrift help

2012-04-16 Thread Rahul Jain
I am assuming you read thru: https://cwiki.apache.org/Hive/hiveserver.html The server comes up on port 10,000 by default, did you verify that it is actually listening on the port ? You can also connect to hive server using web browser to confirm its status. -Rahul On Mon, Apr 16, 2012 at 1:53

Re: Map Task Capacity Not Changing

2011-12-16 Thread Rahul Jain
You might be suffering from HADOOP-7822; I'd suggest you verify your pid files and fix the problem by hand if it is the same issue. -Rahul On Fri, Dec 16, 2011 at 2:40 PM, Joey Krabacher jkrabac...@gmail.comwrote: Turns out my tasktrackers(on the datanodes) are not starting properly so I

Re: How do I connect Java Visual VM to a remote task?

2011-10-17 Thread Rahul Jain
The easy way to debug such problems in our experience is to use 'jmap' to take a few snapshots of one of the tasktrackers (child tasks) and analyze them under a profiler tool such as jprofiler, yourkit etc. This should give you pretty good indication of objects that are using up most heap memory.

Re: ChainMapper and ChainReducer: Are the key/value pairs distributed to the nodes of the cluster before each Map phase?

2011-04-29 Thread Rahul Jain
Your latter statement is correct: if the output of the Map1 phase (or Reduce phase) is immediately inserted to Map2 phase (or Map3 Phase) within the same node, without any distribution. ChainMappers / ChainReducers are just convenience classes to allow reuse of mapper code whether executing as

Re: Reduce java.lang.OutOfMemoryError

2011-02-16 Thread Rahul Jain
If you google for such memory failures, you'll find the mapreduce tunable that'll help you: mapred.job.shuffle.input.buffer.percent ; it is well known that the default values in hadoop config don't work well for large data systems -Rahul On Wed, Feb 16, 2011 at 10:36 AM, James Seigel

Re: Help: How to increase amont maptasks per job ?

2011-01-07 Thread Rahul Jain
Also make sure you've enough input files for the next stage mappers to work with... Read thru the input splits part of tutorial: http://wiki.apache.org/hadoop/HadoopMapReduce If the last stage had only 4 reducers running, they'd generate 4 output files. This will limit the # of mappers started

Re: Can MapReduce run simultaneous producer/consumer processes?

2011-01-06 Thread Rahul Jain
In case the producer / consumer don't require sorting to happen, take a look at ChainMapper: http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/ChainMapper.html If you do want the stuff to happen after sorting, take a look at:

Re: How to build multiple inverted indexes?

2010-07-29 Thread Rahul Jain
Hadoop does not prevent you from writing key value pair multiple times in the same map iteration if that is what is your roadblock. You can call collector.collect() multiple times with same or distinct key value pairs within a single map iteration. -Rahul On Thu, Jul 29, 2010 at 8:10 AM,

Re: reading distributed cache returns null pointer

2010-07-08 Thread Rahul Jain
I am not sure why you are using getFileClassPaths() API to access files... here is what works for us: Add the file(s) to distributed cache using: DistributedCache.addCacheFile(p.toUri(), conf); Read the files on the mapper using: URI[] uris = DistributedCache.getCacheFiles(conf); // access one

Re: reading distributed cache returns null pointer

2010-07-08 Thread Rahul Jain
()); if(hdfs.exists(my_path)) { FSDataInputStreamfs=hdfs.open(my_path); while((str=fs.readLine())!=null) System.out.println(str); } Thanks From: Rahul Jain rja

Re: Hadoop JobTracker Hanging

2010-06-22 Thread Rahul Jain
There are two issues which were fixed in 0.21.0 and can cause job tracker to run out of memory: https://issues.apache.org/jira/browse/MAPREDUCE-1316 and https://issues.apache.org/jira/browse/MAPREDUCE-841 We've been hit by MAPREDUCE-841 (large jobConf objects with large number of tasks,