On usig Eclipse IDE

2009-05-06 Thread George Pang
Dear Users, I configure Eclipse Europa according to Yahoo tutorial on hadoop: http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html and in the instruction it goes about creating new DFS Location: “…..Next, click on the “Advanced” tab. There are two settings here which must be

FileSystem.Statistics does not update at once?

2009-05-06 Thread Xie, Tao
I run a I/O test with M/R framework. Each mapper writes a 200M file to HDFs. I print the bytesRead and bytesWrite of FileSystem.Statistics every 1000ms. But I see these two values do not update immediately as the M/R progress forward. Anybody know the reason? Thanks. -- View this message in

Re: PIG and Hive

2009-05-06 Thread asif md
http://www.cloudera.com/hadoop-training-hive-introduction http://www.cloudera.com/hadoop-training-pig-introduction On Wed, May 6, 2009 at 1:17 AM, Ricky Ho r...@adobe.com wrote: Are they competing technologies of providing a higher level language for Map/Reduce programming ? Or are they

RE: On usig Eclipse IDE

2009-05-06 Thread Puri, Aseem
George, In my Eclipse Europa it is showing the attribute hadoop.job.ugi. It is after the fs.trash.interval. Thanks Regards Aseem Puri -Original Message- From: George Pang [mailto:p09...@gmail.com] Sent: Wednesday, May 06, 2009 1:07 PM To: core-user@hadoop.apache.org;

Re: move tasks to another machine on the fly

2009-05-06 Thread Tom White
Hi David, The MapReduce framework will attempt to rerun failed tasks automatically. However, if a task is running out of memory on one machine, it's likely to run out of memory on another, isn't it? Have a look at the mapred.child.java.opts configuration property for the amount of memory that

Is there any performance issue with Jrockit JVM for Hadoop

2009-05-06 Thread Grace
Hi all, This is Grace. I am replacing Sun JVM with Jrockit JVM for Hadoop. Also I keep all the same Java options and configuration as Sun JVM. However it is very strange that the performance using Jrockit JVM is poorer than the one using Sun, such as the map stage became slower. Has anyone

Re: large files vs many files

2009-05-06 Thread Sasha Dolgy
Hi Tom, Thanks for this. I'll follow that up and see how I get on. At issue is the frequency of the data I have streaming in. Even if I create a new file with a name based on milliseconds I'm still running into the same problems. My thought is that using append, although it's not production

Re: Namenode failed to start with FSNamesystem initialization failed error

2009-05-06 Thread Stas Oskin
Hi. Yes, this was probably it. The strangest part, that the HDFS somehow worked even with all files empty in the NN directory. Go figure... Regards. 2009/5/5 Raghu Angadi rang...@yahoo-inc.com the image is stored in two files : fsimage and edits (under namenode-directory/current/).

Re: move tasks to another machine on the fly

2009-05-06 Thread Steve Loughran
Tom White wrote: Hi David, The MapReduce framework will attempt to rerun failed tasks automatically. However, if a task is running out of memory on one machine, it's likely to run out of memory on another, isn't it? Have a look at the mapred.child.java.opts configuration property for the amount

Small issues regarding hadoop/hbase

2009-05-06 Thread Rakhi Khatwani
Hi, I have a couple of small issues regarding hadoop/hbase 1. i wanna scan a table, but the table is really huge. so i want the result of the scan to some file so that i can analyze it. how do we go about it??? 2. how do you dynamically add and remove nodes in the cluser without disturbing the

Using multiple FileSystems in hadoop input

2009-05-06 Thread Ivan Balashov
Greetings to all, Could anyone suggest if Paths from different FileSystems can be used as input of Hadoop job? Particularly I'd like to find out whether Paths from HarFileSystem can be mixed with ones from DistributedFileSystem. Thanks, -- Kind regards, Ivan

Re: Using multiple FileSystems in hadoop input

2009-05-06 Thread Tom White
Hi Ivan, I haven't tried this combination, but I think it should work. If it doesn't it should be treated as a bug. Tom On Wed, May 6, 2009 at 11:46 AM, Ivan Balashov ibalas...@iponweb.net wrote: Greetings to all, Could anyone suggest if Paths from different FileSystems can be used as input

Re: PIG and Hive

2009-05-06 Thread Sharad Agarwal
see core-user mail thread with subject HBase, Hive, Pig and other Hadoop based technologies - Sharad Ricky Ho wrote: Are they competing technologies of providing a higher level language for Map/Reduce programming ? Or are they complementary ? Any comparison between them ? Rgds,

Re: multi-line records and file splits

2009-05-06 Thread Tom White
Hi Rajarshi, FileInputFormat (SDFInputFormat's superclass) will break files into splits, typically on HDFS block boundaries (if the defaults are left unchanged). This is not a problem for your code however, since it will read every record that starts within a split (even if it crosses a split

Re: multi-line records and file splits

2009-05-06 Thread Sharad Agarwal
The split doesn't need to be at the record boundary. If a mapper gets a partial record, it will seek to another split to get the full record. - Sharad

RE: Changing output file format and name

2009-05-06 Thread Devika Lakshmanan
Hi, Are we supposed to make changes in OutputFormat? If so, how to go about it since it is an interface? If someone has solved this problem, can you kindly mention the steps necessary for the same? Thanks Devika Aruna -Original Message- From: Sharad Agarwal

Re: multi-line records and file splits

2009-05-06 Thread Rajarshi Guha
On May 6, 2009, at 8:22 AM, Tom White wrote: Hi Rajarshi, FileInputFormat (SDFInputFormat's superclass) will break files into splits, typically on HDFS block boundaries (if the defaults are left unchanged). This is not a problem for your code however, since it will read every record that

Re: multi-line records and file splits

2009-05-06 Thread jason hadoop
Hey Tom, I had no luck using the StreamingXmlRecordReader for non XML files are there any parameters that you need to add in? I was testing with 0.19.0 On Wed, May 6, 2009 at 5:25 AM, Sharad Agarwal shara...@yahoo-inc.comwrote: The split doesn't need to be at the record boundary. If a mapper

Is there any performance issue with Jrockit JVM for Hadoop

2009-05-06 Thread Grace
Hi all, This is Grace. I am replacing Sun JVM with Jrockit JVM for Hadoop. Also I keep all the same Java options and configuration as Sun JVM. However it is very strange that the performance using Jrockit JVM is poorer than the one using Sun, such as the map stage became slower. Has anyone

Re: On usig Eclipse IDE

2009-05-06 Thread George Pang
Or, is there a way to know who is the author of that tutorial from Yahoo on hadoop / Eclipse? Thanks George 2009/5/6 George Pang p09...@gmail.com Dear Users, I configure Eclipse Europa according to Yahoo tutorial on hadoop: http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html

Compression support for libhdfs

2009-05-06 Thread Leon Mergen
Hello, After examining the libhdfs library, I cannot find any support for compression - is this correct ? And, if this is the case, is it also correct that it is almost trivial to implement in hdfsOpenFile () by making an additional call to one of the compression codecs createInputStream () /

Re: Changing output file format and name

2009-05-06 Thread Farhan Husain
Hello, You can subclass the OutputFormat class and write your own. You can look at the code of TextOutputFormat, MultipleOutputFormat etc. for reference. It might be the case that you only need to do minor changes to any of the existing OutputFormat classes. To do that you can just subclass that

Folders and files still present after format

2009-05-06 Thread Foss User
Today I formatted the namenode while the namenode and jobtracker was up. I found that I was still able to browse the file system using the command: bin/hadoop dfs -lsr / Then, I stopped the namenode and jobtracker and did a format again. I started the namenode and jobtracker. I could still browse

Cacti Templates for Hadoop

2009-05-06 Thread Edward Capriolo
For those of you that would like to graph the hadoop JMX variables with cacti I have created cacti templates and data input scripts. Currently the package gathers and graphs the following information from the NameNode: Blocks Total Files Total Capacity Used/Capacity Free Live Data Nodes/Dead Data

accessing multiple files in Reducer

2009-05-06 Thread Alan Drew
Hi, I have a question about how to efficiently access multiple files during the Reduce phase. The reducer gets a key, list of values where each key is a different file and the value represents where to look in the file. The files are actually .png images. I have tried using the

Re: Namenode failed to start with FSNamesystem initialization failed error

2009-05-06 Thread Raghu Angadi
Tamir Kamara wrote: Hi Raghu, The thread you posted is my original post written when this problem first happened on my cluster. I can file a JIRA but I wouldn't be able to provide information other than what I already posted and I don't have the logs from that time. Should I still file ? yes.

Re: Folders and files still present after format

2009-05-06 Thread Todd Lipcon
On Wed, May 6, 2009 at 11:40 AM, Foss User foss...@gmail.com wrote: Today I formatted the namenode while the namenode and jobtracker was up. I found that I was still able to browse the file system using the command: bin/hadoop dfs -lsr / Then, I stopped the namenode and jobtracker and did a

Is it possible to sort intermediate values and final values?

2009-05-06 Thread Foss User
Is it possible to sort the intermediate values for each key before they key, list of values pair reaches the reducer? Also, is it possible to sort the final output key, value pairs from reducer before it is written into the HDFS?

Re: accessing multiple files in Reducer

2009-05-06 Thread Arun Jacob
Hi, I'm not sure what kind of constraints you are under, specifically why you wouldn't serve these files up on a (rack local) web server, and mitigate the overhead of the http request by using more slave nodes. You could skip the file load step completely that way. But if you do need to copy files

About Hadoop optimizations

2009-05-06 Thread Foss User
1. Do the reducers of a job start only after all mappers have finished? 2. Say there are 10 slave nodes. Let us say one of the nodes is very slow as compared to other nodes. So, while the mappers in the other 9 have finished in 2 minutes, the one on the slow one might take 20 minutes. Is Hadoop

Large number of map output keys and performance issues.

2009-05-06 Thread Tiago Macambira
I am developing a MR application w/ hadoop that is generating during it's map phase a really large number of output keys and it is having an abysmal performance. While just reading the said data takes 20 minutes and processing it but not outputting anything from the map takes around 30 min,

Re: Folders and files still present after format

2009-05-06 Thread Foss User
On Thu, May 7, 2009 at 12:44 AM, Todd Lipcon t...@cloudera.com wrote: On Wed, May 6, 2009 at 11:40 AM, Foss User foss...@gmail.com wrote: Today I formatted the namenode while the namenode and jobtracker was up. I found that I was still able to browse the file system using the command:

java.io.IOException: All datanodes are bad. Aborting...

2009-05-06 Thread Mayuran Yogarajah
I have 2 directories listed for dfs.data.dir and one of them got to 100% used during a job I ran. I suspect thats the reason I see this error in the logs. Can someone please confirm this? thanks

I can see only two mappers per node regardless of mapred.map.tasks value

2009-05-06 Thread Seunghwa Kang
Hello, I am running compute intensive job using Hadoop Streaming (hadoop version 0.19.1), and my mapper input has several thousand small files. My system has 4 nodes and 8 cores per node. I want to run 8 mappers per node to use all 8 cores, but whatever the mapred.map.tasks value is, I can see

Re: About Hadoop optimizations

2009-05-06 Thread Todd Lipcon
On Wed, May 6, 2009 at 12:22 PM, Foss User foss...@gmail.com wrote: 1. Do the reducers of a job start only after all mappers have finished? The reducer tasks start so they can begin copying map output, but your actual reduce function does not. This is because it doesn't know that the data for

Re: I can see only two mappers per node regardless of mapred.map.tasks value

2009-05-06 Thread Todd Lipcon
On Wed, May 6, 2009 at 1:10 PM, Seunghwa Kang s.k...@gatech.edu wrote: Hello, I am running compute intensive job using Hadoop Streaming (hadoop version 0.19.1), and my mapper input has several thousand small files. My system has 4 nodes and 8 cores per node. I want to run 8 mappers per

Re: Folders and files still present after format

2009-05-06 Thread Todd Lipcon
On Wed, May 6, 2009 at 12:26 PM, Foss User foss...@gmail.com wrote: Yes, as far as I remember but I am not absolutely sure. From your reply, I understand what I experienced (may be due to my fault) is not an expected behavior. So, if I face the same error again I would like to provide more

Re: Large number of map output keys and performance issues.

2009-05-06 Thread Todd Lipcon
Hi Tiago, Here are a couple thoughts: 1) How much data are you outputting? Obviously there is a certain amount of IO involved in actually outputting data versus not ;-) 2) Are you using a reduce phase in this job? If so, since you're cutting off the data at map output time, you're also avoiding

Re: About Hadoop optimizations

2009-05-06 Thread Foss User
Thanks for your response. I got a few more questions regarding optimizations. 1. Does hadoop clients locally cache the data it last requested? 2. Is the meta data for file blocks on data node kept in the underlying OS's file system on namenode or is it kept in RAM of the name node? 3. If no

RE: PIG and Hive

2009-05-06 Thread Ricky Ho
Jeff, Thanks for the pointer. It is pretty clear that Hive and PIG are the same kind and HBase is a different kind. The difference between PIG and Hive seems to be pretty insignificant. Layer a tool on top of them can completely hide their difference. I am viewing your PIG and Hive tutorial

RE: PIG and Hive

2009-05-06 Thread Ricky Ho
Thanks Amr, Without knowing the details of Hive, one constraint of SQL model is you can never generate more than one records from a single record. I don't know how this is done in Hive. Another question is whether the Hive script can take in user-defined functions ? Using the following word

RE: PIG and Hive

2009-05-06 Thread Ashish Thusoo
Ricky, For your particular example Hive allows you to plugin a user defined map and reduce script (in the language of your choice) within Hive QL (there are some minor extensions to SQL to support such a use case). So for your case you could do the following: FROM (FROM lines MAP line

RE: PIG and Hive

2009-05-06 Thread Olga Natkovich
Hi Ricky, This is how the code will look in Pig. A = load 'textdoc' using TextLoader() as (sentence: chararray); B = foreach A generate flatten(TOKENIZE(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B); store D into 'wordcount'; Pig training

Re: Job tracker not responding during streaming job

2009-05-06 Thread David Kellogg
I still see the memory leak in the JobTracker, (version 0.19.0, streaming, Java version 1.6). Doubling the heap size simply doubled the time-to-failure. I ran hprof against the jobtracker process. It appears the Counters objects are instantiated many times. The stack traces often point

Re: move tasks to another machine on the fly

2009-05-06 Thread David Batista
I was just asking, because I got to the point were all the maps() were done, and I had configured one cluster to run 3 reduce(), but it was too much for that machine, so everything was done, only those 3 tasks needed to complete, but as they were running the 3 running at the same time, they would

Re: java.io.EOFException: while trying to read 65557 bytes

2009-05-06 Thread Albert Sunwoo
Thanks for the info! I was hoping to get some more specific information though. We are seeing these occur during every run, and as such it's not leaving some folks in our organization with a good feeling about the reliability of HDFS. Do these occur as a result of resources being unavailable?

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-06 Thread imcaptor
Please try -D dfs.block.size=4096000 The specification must be in bytes. On Tue, May 5, 2009 at 4:47 AM, Christian Ulrik Søttrup soett...@nbi.dk wrote: - 隐藏引用文字 - Hi all, I have a job that creates very big local files so i need to split it to as many mappers as possible. Now the DFS block

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-06 Thread Jonathan Cao
There are at least two design choices in Hadoop that have implications for your scenario. 1. All the HDFS meta data is stored in name node memory -- the memory size is one limitation on how many small files you can have 2. The efficiency of map/reduce paradigm dictates that each mapper/reducer

RE: PIG and Hive

2009-05-06 Thread Ricky Ho
Ashish, Thanks for your code. So the map_script is kinda like a subquery. Why do I need to use a customized reduce_script in the wordcount example ? Can I just use the count(*) groupby word ? We cannot assume a fix explosion factor, a line is a variable length word array. Supporting the

Re: About Hadoop optimizations

2009-05-06 Thread Foss User
Thanks for your response again. I could not understand a few things in your reply. So, I want to clarify them. Please find my questions inline. On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon t...@cloudera.com wrote: On Wed, May 6, 2009 at 1:46 PM, Foss User foss...@gmail.com wrote: 2. Is the meta

Re: PIG and Hive

2009-05-06 Thread Luc Hunt
Ricky, One thing to mention is, SQL support is on the Pig roadmap this year. --Yiping On Wed, May 6, 2009 at 9:11 PM, Ricky Ho r...@adobe.com wrote: Thanks for Olga example and Scott's comment. My goal is to pick a higher level parallel programming language (as a algorithm design /