Dear Users,
I configure Eclipse Europa according to Yahoo tutorial on hadoop:
http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html
and in the instruction it goes about creating new DFS Location:
“…..Next, click on the “Advanced” tab. There are two settings here which
must be
I run a I/O test with M/R framework. Each mapper writes a 200M file to HDFs.
I print the bytesRead and bytesWrite of FileSystem.Statistics every 1000ms.
But I see these two values do not update immediately as the M/R progress
forward. Anybody know the reason?
Thanks.
--
View this message in
http://www.cloudera.com/hadoop-training-hive-introduction
http://www.cloudera.com/hadoop-training-pig-introduction
On Wed, May 6, 2009 at 1:17 AM, Ricky Ho r...@adobe.com wrote:
Are they competing technologies of providing a higher level language for
Map/Reduce programming ?
Or are they
George,
In my Eclipse Europa it is showing the attribute
hadoop.job.ugi. It is after the fs.trash.interval.
Thanks Regards
Aseem Puri
-Original Message-
From: George Pang [mailto:p09...@gmail.com]
Sent: Wednesday, May 06, 2009 1:07 PM
To: core-user@hadoop.apache.org;
Hi David,
The MapReduce framework will attempt to rerun failed tasks
automatically. However, if a task is running out of memory on one
machine, it's likely to run out of memory on another, isn't it? Have a
look at the mapred.child.java.opts configuration property for the
amount of memory that
Hi all,
This is Grace.
I am replacing Sun JVM with Jrockit JVM for Hadoop. Also I keep all the same
Java options and configuration as Sun JVM. However it is very strange that
the performance using Jrockit JVM is poorer than the one using Sun, such as
the map stage became slower.
Has anyone
Hi Tom,
Thanks for this. I'll follow that up and see how I get on. At issue is the
frequency of the data I have streaming in. Even if I create a new file with
a name based on milliseconds I'm still running into the same problems. My
thought is that using append, although it's not production
Hi.
Yes, this was probably it.
The strangest part, that the HDFS somehow worked even with all files empty
in the NN directory.
Go figure...
Regards.
2009/5/5 Raghu Angadi rang...@yahoo-inc.com
the image is stored in two files : fsimage and edits
(under namenode-directory/current/).
Tom White wrote:
Hi David,
The MapReduce framework will attempt to rerun failed tasks
automatically. However, if a task is running out of memory on one
machine, it's likely to run out of memory on another, isn't it? Have a
look at the mapred.child.java.opts configuration property for the
amount
Hi, I have a couple of small issues regarding hadoop/hbase
1. i wanna scan a table, but the table is really huge. so i want the result
of the scan to some file so that i can analyze it. how do we go about it???
2. how do you dynamically add and remove nodes in the cluser without
disturbing the
Greetings to all,
Could anyone suggest if Paths from different FileSystems can be used as
input of Hadoop job?
Particularly I'd like to find out whether Paths from HarFileSystem can be
mixed with ones from DistributedFileSystem.
Thanks,
--
Kind regards,
Ivan
Hi Ivan,
I haven't tried this combination, but I think it should work. If it
doesn't it should be treated as a bug.
Tom
On Wed, May 6, 2009 at 11:46 AM, Ivan Balashov ibalas...@iponweb.net wrote:
Greetings to all,
Could anyone suggest if Paths from different FileSystems can be used as
input
see core-user mail thread with subject HBase, Hive, Pig and other Hadoop based
technologies
- Sharad
Ricky Ho wrote:
Are they competing technologies of providing a higher level language for
Map/Reduce programming ?
Or are they complementary ?
Any comparison between them ?
Rgds,
Hi Rajarshi,
FileInputFormat (SDFInputFormat's superclass) will break files into
splits, typically on HDFS block boundaries (if the defaults are left
unchanged). This is not a problem for your code however, since it will
read every record that starts within a split (even if it crosses a
split
The split doesn't need to be at the record boundary. If a mapper gets
a partial record, it will seek to another split to get the full record.
- Sharad
Hi,
Are we supposed to make changes in OutputFormat? If so, how to go about it
since it is an interface?
If someone has solved this problem, can you kindly mention the steps
necessary for the same?
Thanks
Devika Aruna
-Original Message-
From: Sharad Agarwal
On May 6, 2009, at 8:22 AM, Tom White wrote:
Hi Rajarshi,
FileInputFormat (SDFInputFormat's superclass) will break files into
splits, typically on HDFS block boundaries (if the defaults are left
unchanged). This is not a problem for your code however, since it will
read every record that
Hey Tom, I had no luck using the StreamingXmlRecordReader for non XML files
are there any parameters that you need to add in? I was testing with 0.19.0
On Wed, May 6, 2009 at 5:25 AM, Sharad Agarwal shara...@yahoo-inc.comwrote:
The split doesn't need to be at the record boundary. If a mapper
Hi all,
This is Grace.
I am replacing Sun JVM with Jrockit JVM for Hadoop. Also I keep all the same
Java options and configuration as Sun JVM. However it is very strange that
the performance using Jrockit JVM is poorer than the one using Sun, such as
the map stage became slower.
Has anyone
Or, is there a way to know who is the author of that tutorial from Yahoo on
hadoop / Eclipse?
Thanks
George
2009/5/6 George Pang p09...@gmail.com
Dear Users,
I configure Eclipse Europa according to Yahoo tutorial on hadoop:
http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html
Hello,
After examining the libhdfs library, I cannot find any support for compression
- is this correct ?
And, if this is the case, is it also correct that it is almost trivial to
implement in hdfsOpenFile () by making an additional call to one of the
compression codecs createInputStream () /
Hello,
You can subclass the OutputFormat class and write your own. You can look at
the code of TextOutputFormat, MultipleOutputFormat etc. for reference. It
might be the case that you only need to do minor changes to any of the
existing OutputFormat classes. To do that you can just subclass that
Today I formatted the namenode while the namenode and jobtracker was
up. I found that I was still able to browse the file system using the
command: bin/hadoop dfs -lsr /
Then, I stopped the namenode and jobtracker and did a format again. I
started the namenode and jobtracker. I could still browse
For those of you that would like to graph the hadoop JMX variables
with cacti I have created cacti templates and data input scripts.
Currently the package gathers and graphs the following information
from the NameNode:
Blocks Total
Files Total
Capacity Used/Capacity Free
Live Data Nodes/Dead Data
Hi,
I have a question about how to efficiently access multiple files during the
Reduce phase. The reducer gets a key, list of values where each key is a
different file and the value represents where to look in the file. The
files are actually .png images.
I have tried using the
Tamir Kamara wrote:
Hi Raghu,
The thread you posted is my original post written when this problem first
happened on my cluster. I can file a JIRA but I wouldn't be able to provide
information other than what I already posted and I don't have the logs from
that time. Should I still file ?
yes.
On Wed, May 6, 2009 at 11:40 AM, Foss User foss...@gmail.com wrote:
Today I formatted the namenode while the namenode and jobtracker was
up. I found that I was still able to browse the file system using the
command: bin/hadoop dfs -lsr /
Then, I stopped the namenode and jobtracker and did a
Is it possible to sort the intermediate values for each key before
they key, list of values pair reaches the reducer?
Also, is it possible to sort the final output key, value pairs from
reducer before it is written into the HDFS?
Hi, I'm not sure what kind of constraints you are under, specifically why
you wouldn't serve these files up on a (rack local) web server, and mitigate
the overhead of the http request by using more slave nodes. You could skip
the file load step completely that way.
But if you do need to copy files
1. Do the reducers of a job start only after all mappers have finished?
2. Say there are 10 slave nodes. Let us say one of the nodes is very
slow as compared to other nodes. So, while the mappers in the other 9
have finished in 2 minutes, the one on the slow one might take 20
minutes. Is Hadoop
I am developing a MR application w/ hadoop that is generating during it's
map phase a really large number of output keys and it is having an abysmal
performance.
While just reading the said data takes 20 minutes and processing it but not
outputting anything from the map takes around 30 min,
On Thu, May 7, 2009 at 12:44 AM, Todd Lipcon t...@cloudera.com wrote:
On Wed, May 6, 2009 at 11:40 AM, Foss User foss...@gmail.com wrote:
Today I formatted the namenode while the namenode and jobtracker was
up. I found that I was still able to browse the file system using the
command:
I have 2 directories listed for dfs.data.dir and one of them got to 100%
used
during a job I ran. I suspect thats the reason I see this error in the
logs.
Can someone please confirm this?
thanks
Hello,
I am running compute intensive job using Hadoop Streaming (hadoop
version 0.19.1), and my mapper input has several thousand small files.
My system has 4 nodes and 8 cores per node.
I want to run 8 mappers per node to use all 8 cores, but whatever the
mapred.map.tasks value is, I can see
On Wed, May 6, 2009 at 12:22 PM, Foss User foss...@gmail.com wrote:
1. Do the reducers of a job start only after all mappers have finished?
The reducer tasks start so they can begin copying map output, but your
actual reduce function does not. This is because it doesn't know that the
data for
On Wed, May 6, 2009 at 1:10 PM, Seunghwa Kang s.k...@gatech.edu wrote:
Hello,
I am running compute intensive job using Hadoop Streaming (hadoop
version 0.19.1), and my mapper input has several thousand small files.
My system has 4 nodes and 8 cores per node.
I want to run 8 mappers per
On Wed, May 6, 2009 at 12:26 PM, Foss User foss...@gmail.com wrote:
Yes, as far as I remember but I am not absolutely sure. From your
reply, I understand what I experienced (may be due to my fault) is not
an expected behavior. So, if I face the same error again I would like
to provide more
Hi Tiago,
Here are a couple thoughts:
1) How much data are you outputting? Obviously there is a certain amount of
IO involved in actually outputting data versus not ;-)
2) Are you using a reduce phase in this job? If so, since you're cutting off
the data at map output time, you're also avoiding
Thanks for your response. I got a few more questions regarding optimizations.
1. Does hadoop clients locally cache the data it last requested?
2. Is the meta data for file blocks on data node kept in the
underlying OS's file system on namenode or is it kept in RAM of the
name node?
3. If no
Jeff,
Thanks for the pointer.
It is pretty clear that Hive and PIG are the same kind and HBase is a different
kind.
The difference between PIG and Hive seems to be pretty insignificant. Layer a
tool on top of them can completely hide their difference.
I am viewing your PIG and Hive tutorial
Thanks Amr,
Without knowing the details of Hive, one constraint of SQL model is you can
never generate more than one records from a single record. I don't know how
this is done in Hive. Another question is whether the Hive script can take in
user-defined functions ?
Using the following word
Ricky,
For your particular example Hive allows you to plugin a user defined map and
reduce script (in the language of your choice) within Hive QL (there are some
minor extensions to SQL to support such a use case). So for your case you could
do the following:
FROM (FROM lines
MAP line
Hi Ricky,
This is how the code will look in Pig.
A = load 'textdoc' using TextLoader() as (sentence: chararray);
B = foreach A generate flatten(TOKENIZE(sentence)) as word;
C = group B by word;
D = foreach C generate group, COUNT(B);
store D into 'wordcount';
Pig training
I still see the memory leak in the JobTracker, (version 0.19.0,
streaming, Java version 1.6). Doubling the heap size simply doubled
the time-to-failure. I ran hprof against the jobtracker process.
It appears the Counters objects are instantiated many times. The
stack traces often point
I was just asking, because I got to the point were all the maps() were
done, and I had configured one cluster to run 3 reduce(), but it was
too much for that machine, so everything was done, only those 3 tasks
needed to complete, but as they were running the 3 running at the same
time, they would
Thanks for the info!
I was hoping to get some more specific information though. We are seeing these
occur during every run, and as such it's not leaving some folks in our
organization with a good feeling about the reliability of HDFS.
Do these occur as a result of resources being unavailable?
Please try -D dfs.block.size=4096000
The specification must be in bytes.
On Tue, May 5, 2009 at 4:47 AM, Christian Ulrik Søttrup soett...@nbi.dk
wrote:
- 隐藏引用文字 -
Hi all,
I have a job that creates very big local files so i need to split it to as
many mappers as possible. Now the DFS block
There are at least two design choices in Hadoop that have implications for
your scenario.
1. All the HDFS meta data is stored in name node memory -- the memory size
is one limitation on how many small files you can have
2. The efficiency of map/reduce paradigm dictates that each mapper/reducer
Ashish,
Thanks for your code. So the map_script is kinda like a subquery.
Why do I need to use a customized reduce_script in the wordcount example ? Can
I just use the count(*) groupby word ?
We cannot assume a fix explosion factor, a line is a variable length word
array. Supporting the
Thanks for your response again. I could not understand a few things in
your reply. So, I want to clarify them. Please find my questions
inline.
On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon t...@cloudera.com wrote:
On Wed, May 6, 2009 at 1:46 PM, Foss User foss...@gmail.com wrote:
2. Is the meta
Ricky,
One thing to mention is, SQL support is on the Pig roadmap this year.
--Yiping
On Wed, May 6, 2009 at 9:11 PM, Ricky Ho r...@adobe.com wrote:
Thanks for Olga example and Scott's comment.
My goal is to pick a higher level parallel programming language (as a
algorithm design /
51 matches
Mail list logo