Hi Raj
The easiest approach to pull out task log is using JT web UI.
Got to JT web UI, drill down on the sqoop job. You'll get a list of
failed/killed tasks, your failed thask would be in there. Clicking on that task
would give you the logs for the same.
Regards
Bejoy KS
Sent from remote de
Hi Ashish
In your hdfs-site.xml within tag you need to have the
tag and inside a tag you can have , and
tags.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: Ashish Umrani
Date: Tue, 23 Jul 2013 09:28:00
To:
Reply-To: user@hadoop.apache
Job,
You need to set in on every hive session/CLI client.
This property is a job level one and it is used to indicate which pool/queue a
job should be submitted on to.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: "Job Thomas"
Date: Thu, 30
When you take a mapreduce tasks, you need CPU cycles to do the processing, not
just memory.
So ideally based on the processor type(hyperthreaded or not) compute the
available cores. Then may be compute as, one core for each task slot.
Regards
Bejoy KS
Sent from remote device, Please excuse ty
Hi
I assume the question is on how many slots.
It dependents on
- the child/task jvm size and the available memory.
- available number of cores
Your available memory for tasks is total memory - memory used for OS and other
services running on your box.
Other services include non hadoop serv
If you have snappy codec in io.compression.codecs then you can easily
decompress the data out of hdfs directly with a simple command.
hadoop fs -text
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: Jean-Marc Spaggiari
Date: Tue, 21 May 2013 12:
You are correct, map outputs are stored in LFS not in HDFS.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: Ramesh R Nair
Date: Wed, 17 Apr 2013 13:06:32
To: ;
Subject: Re: Basic Doubt in Hadoop
Hi Bejoy,
Regarding the output of Map phase,
Yes, That is a valid point.
The partitioner might do non uniform distribution and reducers can be unevenly
loaded.
But this doesn't change the number of reducers and its distribution across
nodes. The bottom issue as I understand is that his reduce tasks are scheduled
on just a few nodes.
Reg
Uniform Data distribution across HDFS is one of the factor that ensures map
tasks are uniformly distributed across nodes. But reduce tasks doesn't depend
on data distribution it is purely based on slot availability.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original
Hi Rauljin
Few things to check here.
What is the number of reduce slots in each Task Tracker? What is the number of
reduce tasks for your job?
Based on the availability of slots the reduce tasks are scheduled on TTs.
You can do the following
Set the number of reduce tasks to 8 or more.
Play wi
The data is in HDFS in case of WordCount MR sample.
In hdfs, you have the metadata in NameNode and actual data as blocks replicated
across DataNodes.
In case of reducer, If a reducer is running on a particular node then you have
one replica of the blocks in the same node (If there is no space
Also, You need to change the value for 'local.cache.size' in core-site.x.l not
in core-default.xml.
If you need to override any property in config files do it in *-site.xml not in
*-default.xml.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: b
You can get your Job.xml for each jobs from The JT web UI. Click on the job, on
the specific job page you'll get this.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From:
Date: Tue, 16 Apr 2013 12:45:26
To:
Reply-To: user@hadoop.apache.org
Subject:
Hi Rahul
AFAIK there is no guarantee that 1 task would be on N1 and another on N2. Both
can be on N1 as well.
JT has no notion of JVM reuse. It doesn't consider that for task scheduling.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: Rahul Bha
Hi
Any node would submit the job to JobTracker which distributes the jar to
TaskTrackers and individual tasks are executed on nodes across the cluster.
MR tasks are executed across the cluster.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: Kay
Hi Brice
By adding a new storage location to dfs.data.dir you are not incrementing the
replication factor.
You are giving one mode location for the blocks to be copied for that data node.
There is no new DataNode added. A new data node would be live only if tweak
your configs and start a new D
Hi Sai
The location you are seeing should be the mapred.local.dir .
From my understanding the files in distributed cache would be available in
that location while you are running the job and would be cleaned up at the end
of it.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
Hi Samir
Looks like there is some syntax issue with the sql query generated internally .
Can you try doing a Sqoop import by specifying the query with -query option.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: samir das mohapatra
Date: Thu,
Hi Sameer
The query
"SELECT t.* FROM hgopalan.hana_training AS t WHERE 1=0"
Is first executed by SQOOP to fetch the metadata.
The actual data fetch happens as part of individual queries from each task
which would be a sub query of the whole input query.
Regards
Bejoy KS
Sent from remote d
Hi Amit,
Apart for the hadoop jars, Do you have the same config files
($HADOOP_HOME/conf) that are in the cluster on your analytics server as well?
If you are having the default config files in analytics server then your MR job
would be running locally and not on the cluster.
Regards
Bejoy K
Hi Chris
In larger clusters it is better to have an edge/client node where all the user
jars reside and you trigger your MR jobs from here.
A client/edge node is a server with hadoop jars and conf but hosting no daemons.
In smaller clusters one DN might act as the client node and you can execut
Hi Savitha
HA is a new feature n hadoop introduced in Hadoop 2.x releases. So It is a new
feature on top of Hadoop cluster.
Ganglia is one of the widely used tools to monitor the cluster in detail. On a
basic hdfs and mapreduce level, the JobTracker and NameNode web UI would give
you a good co
Hi Jamal
You can use Distributed Cache only if the file to be distributed is small.
Mapreduce should be dealing with larger datasets so you should expect the
output file to get larger.
In simple straight forward manner. You can get the second data set processed
then merge the fist output with
Hi Panshul
SecondaryNameNode is rather known as check point node. At periodic intervals it
merges the editlog from NN with FS image to prevent the edit log from growing
too large. This is its main functionality.
At any point the SNN would have the latest fs image but not the updated edit
log.
Hi Terry
When the file is unzipped and zipped, what is the number of map tasks running
in each case?
If the file is large, I assume the below should be the case.
gz is not splttable compression codec so the whole file would be processed by a
single mapper. And this might be causing the job to
Hi Jamal
I believe a reduce side join is what you are looking for.
You can use MultipleInputs and achieve a reduce side join to achieve this.
http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Origi
Hi Panshul,
Usually for reliability there will be multiple dfs.name.dir configured. Of
which one would be a remote location such as a nfs mount.
So that even if the NN machine crashes on a whole you still have the fs image
and edit log in nfs mount. This can be utilized for reconstructing the
Hi
To add on to Harsh's comments.
You need not have to change the task time out.
In your map/reduce code, you can increment a counter or report status
intermediate on intervals so that there is communication from the task and
hence won't have a task time out.
Every map and reduce task run on
Hi Peter
Did you ensure that using SequenceFileOutputFormat from the right package?
Based on the API you are using, mapred or mapreduce you need to use the
OutputFormat from the corresponding package.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
Fr
Hi Chen,
You do have an option in hadoop to achieve this if you want the merged file in
LFS.
1) Run your job with n number of reducers. And you'll have n files in the
output dir.
2) Issue a hadoop fs -getmerge command to get the files in output dir merged
into a single file in LFS
(In recent
Hi Manoj
Go to the JT web UI, browse to the failed tasks. Identify which task threw the
space related error. Ssh to that node and check the disk space on that node.
Some partitions might have got 100% filled.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message
Hi Peter
Can you try the following in your code
1. Driver class to implement Tools interface
2. Do a getConfiguration() rather than creating a new conf instance.
DC should be working with the above mentioned modifications to code.
Sent on my BlackBerry® from Vodafone
-Original Message-
32 matches
Mail list logo